All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/15] amd-cppc CPU Performance Scaling Driver
@ 2025-03-06  8:39 Penny Zheng
  2025-03-06  8:39 ` [PATCH v3 01/15] xen/cpufreq: introduces XEN_PM_PSD for solely delivery of _PSD Penny Zheng
                   ` (14 more replies)
  0 siblings, 15 replies; 60+ messages in thread
From: Penny Zheng @ 2025-03-06  8:39 UTC (permalink / raw)
  To: xen-devel
  Cc: ray.huang, Penny Zheng, Jan Beulich, Andrew Cooper,
	Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, Juergen Gross

amd-cppc is the AMD CPU performance scaling driver that introduces a
new CPU frequency control mechanism on modern AMD APU and CPU series in
Xen. The new mechanism is based on Collaborative Processor Performance
Control (CPPC) which provides finer grain frequency management than
legacy ACPI hardware P-States. Current AMD CPU/APU platforms are using
the ACPI P-states driver to manage CPU frequency and clocks with
switching only in 3 P-states. CPPC replaces the ACPI P-states controls
and allows a flexible, low-latency interface for Xen to directly
communicate the performance hints to hardware.

amd_cppc driver has 2 operation modes: autonomous (active) mode,
and non-autonomous (passive) mode. We register different CPUFreq driver
for different modes, "amd-cppc" for passive mode and "amd-cppc-epp"
for active mode.

The passive mode leverages common governors such as *ondemand*,
*performance*, etc, to manage the performance hints. And the active mode
uses epp to provides a hint to the hardware if software wants to bias
toward performance (0x0) or energy efficiency (0xff). CPPC power algorithm
in hardware will automatically calculate the runtime workload and adjust the
realtime cpu cores frequency according to the power supply and thermal, core
voltage and some other hardware conditions.

amd-cppc is enabled on passive mode with a top-level `cpufreq=amd-cppc` option,
while users add extra `active` flag to select active mode.

With `cpufreq=amd-cppc,active`, we did a 60s sampling test to see the CPU
frequency change, through tweaking the energy_perf preference from
`xenpm set-cpufreq-cppc powersave` to `xenpm set-cpufreq-cppc performance`.
The outputs are as follows:
```
Setting CPU in powersave mode
Sampling and Outputs:
  Avg freq      2000000 KHz
  Avg freq      2000000 KHz
  Avg freq      2000000 KHz
Setting CPU in performance mode
Sampling and Outputs:
  Avg freq      4640000 KHz
  Avg freq      4220000 KHz
  Avg freq      4640000 KHz

Penny Zheng (15):
  xen/cpufreq: introduces XEN_PM_PSD for solely delivery of _PSD
  xen/x86: introduce new sub-hypercall to propagate CPPC data
  xen/cpufreq: refactor cmdline "cpufreq=xxx"
  xen/cpufreq: move XEN_PROCESSOR_PM_xxx to internal header
  xen/x86: introduce "cpufreq=amd-cppc" xen cmdline
  xen/cpufreq: disable px statistic info in amd-cppc mode
  xen/cpufreq: fix core frequency calculation for AMD Family 1Ah CPUs
  xen/amd: export processor max frequency value
  xen/x86: introduce a new amd cppc driver for cpufreq scaling
  xen/cpufreq: only set gov NULL when cpufreq_driver.setpolicy is NULL
  xen/cpufreq: abstract Energy Performance Preference value
  xen/x86: implement EPP support for the amd-cppc driver in active mode
  tools/xenpm: Print CPPC parameters for amd-cppc driver
  xen/xenpm: Adapt cpu frequency monitor in xenpm
  xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC xen_sysctl_pm_op for amd-cppc
    driver

 docs/misc/xen-command-line.pandoc         |  27 +-
 tools/libs/ctrl/xc_pm.c                   |  12 +-
 tools/misc/xenpm.c                        |  23 +-
 xen/arch/x86/acpi/cpufreq/Makefile        |   1 +
 xen/arch/x86/acpi/cpufreq/acpi.c          |  14 +-
 xen/arch/x86/acpi/cpufreq/amd-cppc.c      | 681 ++++++++++++++++++++++
 xen/arch/x86/acpi/cpufreq/cpufreq.c       |  34 +-
 xen/arch/x86/acpi/cpufreq/hwp.c           |  10 +-
 xen/arch/x86/acpi/cpufreq/powernow.c      |   2 +-
 xen/arch/x86/cpu/amd.c                    |  37 +-
 xen/arch/x86/include/asm/amd.h            |   1 +
 xen/arch/x86/include/asm/msr-index.h      |   5 +
 xen/arch/x86/platform_hypercall.c         |  25 +
 xen/arch/x86/pv/dom0_build.c              |   1 -
 xen/arch/x86/setup.c                      |   1 +
 xen/arch/x86/x86_64/cpufreq.c             |   4 +
 xen/common/domain.c                       |   1 +
 xen/drivers/acpi/pmstat.c                 |  54 +-
 xen/drivers/cpufreq/cpufreq.c             | 258 +++++---
 xen/drivers/cpufreq/cpufreq_ondemand.c    |   2 +-
 xen/drivers/cpufreq/utility.c             |  18 +
 xen/include/acpi/cpufreq/cpufreq.h        |  31 +
 xen/include/acpi/cpufreq/processor_perf.h |  18 +-
 xen/include/public/platform.h             |  38 +-
 xen/include/public/sysctl.h               |   2 +
 xen/include/public/xen.h                  |   1 -
 xen/include/xen/pmstat.h                  |   4 +
 xen/include/xlat.lst                      |   3 +-
 28 files changed, 1160 insertions(+), 148 deletions(-)
 create mode 100644 xen/arch/x86/acpi/cpufreq/amd-cppc.c

-- 
2.34.1



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v3 01/15] xen/cpufreq: introduces XEN_PM_PSD for solely delivery of _PSD
  2025-03-06  8:39 [PATCH v3 00/15] amd-cppc CPU Performance Scaling Driver Penny Zheng
@ 2025-03-06  8:39 ` Penny Zheng
  2025-03-24 14:08   ` Jan Beulich
  2025-03-06  8:39 ` [PATCH v3 02/15] xen/x86: introduce new sub-hypercall to propagate CPPC data Penny Zheng
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Penny Zheng @ 2025-03-06  8:39 UTC (permalink / raw)
  To: xen-devel
  Cc: ray.huang, Penny Zheng, Jan Beulich, Andrew Cooper,
	Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini

_PSD(P-State Dependency) provides performance control, no matter legacy
P-state or CPPC, logical processor dependency information to OSPM.

In order to re-use it for CPPC, this commit extracts the delivery of _PSD info
from set_px_pminfo() and wrap it with a new sub-hypercall XEN_PM_PSD.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
---
v2 -> v3:
- new commit
---
 xen/arch/x86/acpi/cpufreq/acpi.c          |   2 +-
 xen/arch/x86/acpi/cpufreq/powernow.c      |   2 +-
 xen/arch/x86/platform_hypercall.c         |  11 ++
 xen/arch/x86/x86_64/cpufreq.c             |   2 +
 xen/drivers/cpufreq/cpufreq.c             | 122 +++++++++++++---------
 xen/drivers/cpufreq/cpufreq_ondemand.c    |   2 +-
 xen/include/acpi/cpufreq/processor_perf.h |   4 +-
 xen/include/public/platform.h             |  17 +--
 xen/include/xen/pmstat.h                  |   2 +
 xen/include/xlat.lst                      |   2 +-
 10 files changed, 103 insertions(+), 63 deletions(-)

diff --git a/xen/arch/x86/acpi/cpufreq/acpi.c b/xen/arch/x86/acpi/cpufreq/acpi.c
index 0c25376406..0cf94ab2d6 100644
--- a/xen/arch/x86/acpi/cpufreq/acpi.c
+++ b/xen/arch/x86/acpi/cpufreq/acpi.c
@@ -393,7 +393,7 @@ static int cf_check acpi_cpufreq_cpu_init(struct cpufreq_policy *policy)
     data->acpi_data = &processor_pminfo[cpu]->perf;
 
     perf = data->acpi_data;
-    policy->shared_type = perf->shared_type;
+    policy->shared_type = processor_pminfo[cpu]->shared_type;
 
     switch (perf->control_register.space_id) {
     case ACPI_ADR_SPACE_SYSTEM_IO:
diff --git a/xen/arch/x86/acpi/cpufreq/powernow.c b/xen/arch/x86/acpi/cpufreq/powernow.c
index 69364e1855..69ad403fc1 100644
--- a/xen/arch/x86/acpi/cpufreq/powernow.c
+++ b/xen/arch/x86/acpi/cpufreq/powernow.c
@@ -218,7 +218,7 @@ static int cf_check powernow_cpufreq_cpu_init(struct cpufreq_policy *policy)
     data->acpi_data = &processor_pminfo[cpu]->perf;
 
     info.perf = perf = data->acpi_data;
-    policy->shared_type = perf->shared_type;
+    policy->shared_type = processor_pminfo[cpu]->shared_type;
 
     if (policy->shared_type == CPUFREQ_SHARED_TYPE_ALL ||
         policy->shared_type == CPUFREQ_SHARED_TYPE_ANY) {
diff --git a/xen/arch/x86/platform_hypercall.c b/xen/arch/x86/platform_hypercall.c
index 90abd3197f..b0d98b5840 100644
--- a/xen/arch/x86/platform_hypercall.c
+++ b/xen/arch/x86/platform_hypercall.c
@@ -571,6 +571,17 @@ ret_t do_platform_op(
             ret = acpi_set_pdc_bits(op->u.set_pminfo.id, pdc);
             break;
         }
+        case XEN_PM_PSD:
+            if ( !(xen_processor_pmbits & XEN_PROCESSOR_PM_PX) )
+            {
+                ret = -EOPNOTSUPP;
+                break;
+            }
+
+            ret = set_psd_pminfo(op->u.set_pminfo.id,
+                                 op->u.set_pminfo.shared_type,
+                                 &op->u.set_pminfo.u.domain_info);
+            break;
 
         default:
             ret = -EINVAL;
diff --git a/xen/arch/x86/x86_64/cpufreq.c b/xen/arch/x86/x86_64/cpufreq.c
index e4f3d5b436..d1b93b8eef 100644
--- a/xen/arch/x86/x86_64/cpufreq.c
+++ b/xen/arch/x86/x86_64/cpufreq.c
@@ -28,6 +28,8 @@
 
 CHECK_processor_px;
 
+CHECK_psd_package;
+
 DEFINE_XEN_GUEST_HANDLE(compat_processor_px_t);
 
 int compat_set_px_pminfo(uint32_t acpi_id,
diff --git a/xen/drivers/cpufreq/cpufreq.c b/xen/drivers/cpufreq/cpufreq.c
index 4a103c6de9..638476ca15 100644
--- a/xen/drivers/cpufreq/cpufreq.c
+++ b/xen/drivers/cpufreq/cpufreq.c
@@ -36,6 +36,7 @@
 #include <xen/string.h>
 #include <xen/timer.h>
 #include <xen/xmalloc.h>
+#include <xen/xvmalloc.h>
 #include <xen/guest_access.h>
 #include <xen/domain.h>
 #include <xen/cpu.h>
@@ -201,15 +202,15 @@ int cpufreq_add_cpu(unsigned int cpu)
     struct cpufreq_dom *cpufreq_dom = NULL;
     struct cpufreq_policy new_policy;
     struct cpufreq_policy *policy;
-    struct processor_performance *perf;
+    struct processor_pminfo *pmpt;
 
     /* to protect the case when Px was not controlled by xen */
     if ( !processor_pminfo[cpu] || !cpu_online(cpu) )
         return -EINVAL;
 
-    perf = &processor_pminfo[cpu]->perf;
+    pmpt = processor_pminfo[cpu];
 
-    if ( !(perf->init & XEN_PX_INIT) )
+    if ( !(pmpt->perf.init & XEN_PX_INIT) )
         return -EINVAL;
 
     if (!cpufreq_driver.init)
@@ -218,10 +219,10 @@ int cpufreq_add_cpu(unsigned int cpu)
     if (per_cpu(cpufreq_cpu_policy, cpu))
         return 0;
 
-    if (perf->shared_type == CPUFREQ_SHARED_TYPE_HW)
+    if (pmpt->shared_type == CPUFREQ_SHARED_TYPE_HW)
         hw_all = 1;
 
-    dom = perf->domain_info.domain;
+    dom = pmpt->domain_info.domain;
 
     list_for_each(pos, &cpufreq_dom_list_head) {
         cpufreq_dom = list_entry(pos, struct cpufreq_dom, node);
@@ -246,18 +247,18 @@ int cpufreq_add_cpu(unsigned int cpu)
     } else {
         /* domain sanity check under whatever coordination type */
         firstcpu = cpumask_first(cpufreq_dom->map);
-        if ((perf->domain_info.coord_type !=
-            processor_pminfo[firstcpu]->perf.domain_info.coord_type) ||
-            (perf->domain_info.num_processors !=
-            processor_pminfo[firstcpu]->perf.domain_info.num_processors)) {
+        if ((pmpt->domain_info.coord_type !=
+            processor_pminfo[firstcpu]->domain_info.coord_type) ||
+            (pmpt->domain_info.num_processors !=
+            processor_pminfo[firstcpu]->domain_info.num_processors)) {
 
             printk(KERN_WARNING "cpufreq fail to add CPU%d:"
                    "incorrect _PSD(%"PRIu64":%"PRIu64"), "
                    "expect(%"PRIu64"/%"PRIu64")\n",
-                   cpu, perf->domain_info.coord_type,
-                   perf->domain_info.num_processors,
-                   processor_pminfo[firstcpu]->perf.domain_info.coord_type,
-                   processor_pminfo[firstcpu]->perf.domain_info.num_processors
+                   cpu, pmpt->domain_info.coord_type,
+                   pmpt->domain_info.num_processors,
+                   processor_pminfo[firstcpu]->domain_info.coord_type,
+                   processor_pminfo[firstcpu]->domain_info.num_processors
                 );
             return -EINVAL;
         }
@@ -305,7 +306,7 @@ int cpufreq_add_cpu(unsigned int cpu)
         goto err1;
 
     if (hw_all || (cpumask_weight(cpufreq_dom->map) ==
-                   perf->domain_info.num_processors)) {
+                   pmpt->domain_info.num_processors)) {
         memcpy(&new_policy, policy, sizeof(struct cpufreq_policy));
         policy->governor = NULL;
 
@@ -359,24 +360,24 @@ int cpufreq_del_cpu(unsigned int cpu)
     struct list_head *pos;
     struct cpufreq_dom *cpufreq_dom = NULL;
     struct cpufreq_policy *policy;
-    struct processor_performance *perf;
+    struct processor_pminfo *pmpt;
 
     /* to protect the case when Px was not controlled by xen */
     if ( !processor_pminfo[cpu] || !cpu_online(cpu) )
         return -EINVAL;
 
-    perf = &processor_pminfo[cpu]->perf;
+    pmpt = processor_pminfo[cpu];
 
-    if ( !(perf->init & XEN_PX_INIT) )
+    if ( !(pmpt->perf.init & XEN_PX_INIT) )
         return -EINVAL;
 
     if (!per_cpu(cpufreq_cpu_policy, cpu))
         return 0;
 
-    if (perf->shared_type == CPUFREQ_SHARED_TYPE_HW)
+    if (pmpt->shared_type == CPUFREQ_SHARED_TYPE_HW)
         hw_all = 1;
 
-    dom = perf->domain_info.domain;
+    dom = pmpt->domain_info.domain;
     policy = per_cpu(cpufreq_cpu_policy, cpu);
 
     list_for_each(pos, &cpufreq_dom_list_head) {
@@ -393,7 +394,7 @@ int cpufreq_del_cpu(unsigned int cpu)
     /* for HW_ALL, stop gov for each core of the _PSD domain */
     /* for SW_ALL & SW_ANY, stop gov for the 1st core of the _PSD domain */
     if (hw_all || (cpumask_weight(cpufreq_dom->map) ==
-                   perf->domain_info.num_processors))
+                   pmpt->domain_info.num_processors))
         __cpufreq_governor(policy, CPUFREQ_GOV_STOP);
 
     cpufreq_statistic_exit(cpu);
@@ -475,19 +476,13 @@ int set_px_pminfo(uint32_t acpi_id, struct xen_processor_performance *perf)
                acpi_id, cpu);
 
     pmpt = processor_pminfo[cpu];
+    /* Must already allocated in set_psd_pminfo */
     if ( !pmpt )
     {
-        pmpt = xzalloc(struct processor_pminfo);
-        if ( !pmpt )
-        {
-            ret = -ENOMEM;
-            goto out;
-        }
-        processor_pminfo[cpu] = pmpt;
+        ret = -EINVAL;
+        goto out;
     }
     pxpt = &pmpt->perf;
-    pmpt->acpi_id = acpi_id;
-    pmpt->id = cpu;
 
     if ( perf->flags & XEN_PX_PCT )
     {
@@ -537,25 +532,6 @@ int set_px_pminfo(uint32_t acpi_id, struct xen_processor_performance *perf)
             print_PSS(pxpt->states,pxpt->state_count);
     }
 
-    if ( perf->flags & XEN_PX_PSD )
-    {
-        /* check domain coordination */
-        if ( perf->shared_type != CPUFREQ_SHARED_TYPE_ALL &&
-             perf->shared_type != CPUFREQ_SHARED_TYPE_ANY &&
-             perf->shared_type != CPUFREQ_SHARED_TYPE_HW )
-        {
-            ret = -EINVAL;
-            goto out;
-        }
-
-        pxpt->shared_type = perf->shared_type;
-        memcpy(&pxpt->domain_info, &perf->domain_info,
-               sizeof(struct xen_psd_package));
-
-        if ( cpufreq_verbose )
-            print_PSD(&pxpt->domain_info);
-    }
-
     if ( perf->flags & XEN_PX_PPC )
     {
         pxpt->platform_limit = perf->platform_limit;
@@ -570,7 +546,7 @@ int set_px_pminfo(uint32_t acpi_id, struct xen_processor_performance *perf)
         }
     }
 
-    if ( perf->flags == ( XEN_PX_PCT | XEN_PX_PSS | XEN_PX_PSD | XEN_PX_PPC ) )
+    if ( perf->flags == ( XEN_PX_PCT | XEN_PX_PSS | XEN_PX_PPC ) )
     {
         pxpt->init = XEN_PX_INIT;
 
@@ -582,6 +558,54 @@ out:
     return ret;
 }
 
+int set_psd_pminfo(uint32_t acpi_id, uint32_t shared_type,
+                   const struct xen_psd_package *psd_data)
+{
+    int ret = 0, cpuid;
+    struct processor_pminfo *pm_info;
+
+    cpuid = get_cpu_id(acpi_id);
+    if ( cpuid < 0 || !psd_data )
+    {
+        ret = -EINVAL;
+        goto out;
+    }
+
+    /* check domain coordination */
+    if ( shared_type != CPUFREQ_SHARED_TYPE_ALL &&
+         shared_type != CPUFREQ_SHARED_TYPE_ANY &&
+         shared_type != CPUFREQ_SHARED_TYPE_HW )
+    {
+        ret = -EINVAL;
+        goto out;
+    }
+    if ( cpufreq_verbose )
+        printk("Set CPU acpi_id(%d) cpuid(%d) _PSD State info:\n",
+               acpi_id, cpuid);
+
+    pm_info = processor_pminfo[cpuid];
+    if ( !pm_info )
+    {
+        pm_info = xvzalloc(struct processor_pminfo);
+        if ( !pm_info )
+        {
+            ret = -ENOMEM;
+            goto out;
+        }
+        processor_pminfo[cpuid] = pm_info;
+    }
+    pm_info->acpi_id = acpi_id;
+    pm_info->id = cpuid;
+    pm_info->shared_type = shared_type;
+    pm_info->domain_info = *psd_data;
+
+    if ( cpufreq_verbose )
+        print_PSD(&pm_info->domain_info);
+
+ out:
+    return ret;
+}
+
 static void cpufreq_cmdline_common_para(struct cpufreq_policy *new_policy)
 {
     if (usr_max_freq)
diff --git a/xen/drivers/cpufreq/cpufreq_ondemand.c b/xen/drivers/cpufreq/cpufreq_ondemand.c
index 06cfc88d30..5b23daaac1 100644
--- a/xen/drivers/cpufreq/cpufreq_ondemand.c
+++ b/xen/drivers/cpufreq/cpufreq_ondemand.c
@@ -194,7 +194,7 @@ static void dbs_timer_init(struct cpu_dbs_info_s *dbs_info)
 
     set_timer(&per_cpu(dbs_timer, dbs_info->cpu), NOW()+dbs_tuners_ins.sampling_rate);
 
-    if ( processor_pminfo[dbs_info->cpu]->perf.shared_type
+    if ( processor_pminfo[dbs_info->cpu]->shared_type
             == CPUFREQ_SHARED_TYPE_HW )
     {
         dbs_info->stoppable = 1;
diff --git a/xen/include/acpi/cpufreq/processor_perf.h b/xen/include/acpi/cpufreq/processor_perf.h
index 301104e16f..19f5de6b08 100644
--- a/xen/include/acpi/cpufreq/processor_perf.h
+++ b/xen/include/acpi/cpufreq/processor_perf.h
@@ -27,8 +27,6 @@ struct processor_performance {
     struct xen_pct_register status_register;
     uint32_t state_count;
     struct xen_processor_px *states;
-    struct xen_psd_package domain_info;
-    uint32_t shared_type;
 
     uint32_t init;
 };
@@ -36,6 +34,8 @@ struct processor_performance {
 struct processor_pminfo {
     uint32_t acpi_id;
     uint32_t id;
+    struct xen_psd_package domain_info;
+    uint32_t shared_type;
     struct processor_performance    perf;
 };
 
diff --git a/xen/include/public/platform.h b/xen/include/public/platform.h
index 2725b8d104..f5c50380cb 100644
--- a/xen/include/public/platform.h
+++ b/xen/include/public/platform.h
@@ -363,12 +363,12 @@ DEFINE_XEN_GUEST_HANDLE(xenpf_getidletime_t);
 #define XEN_PM_PX   1
 #define XEN_PM_TX   2
 #define XEN_PM_PDC  3
+#define XEN_PM_PSD  4
 
 /* Px sub info type */
 #define XEN_PX_PCT   1
 #define XEN_PX_PSS   2
 #define XEN_PX_PPC   4
-#define XEN_PX_PSD   8
 
 struct xen_power_register {
     uint32_t     space_id;
@@ -439,6 +439,7 @@ struct xen_psd_package {
     uint64_t coord_type;
     uint64_t num_processors;
 };
+typedef struct xen_psd_package xen_psd_package_t;
 
 struct xen_processor_performance {
     uint32_t flags;     /* flag for Px sub info type */
@@ -447,12 +448,6 @@ struct xen_processor_performance {
     struct xen_pct_register status_register;
     uint32_t state_count;     /* total available performance states */
     XEN_GUEST_HANDLE(xen_processor_px_t) states;
-    struct xen_psd_package domain_info;
-    /* Coordination type of this processor */
-#define XEN_CPUPERF_SHARED_TYPE_HW   1 /* HW does needed coordination */
-#define XEN_CPUPERF_SHARED_TYPE_ALL  2 /* All dependent CPUs should set freq */
-#define XEN_CPUPERF_SHARED_TYPE_ANY  3 /* Freq can be set from any dependent CPU */
-    uint32_t shared_type;
 };
 typedef struct xen_processor_performance xen_processor_performance_t;
 DEFINE_XEN_GUEST_HANDLE(xen_processor_performance_t);
@@ -463,9 +458,15 @@ struct xenpf_set_processor_pminfo {
     uint32_t type;  /* {XEN_PM_CX, XEN_PM_PX} */
     union {
         struct xen_processor_power          power;/* Cx: _CST/_CSD */
-        struct xen_processor_performance    perf; /* Px: _PPC/_PCT/_PSS/_PSD */
+        xen_psd_package_t                   domain_info; /* _PSD */
+        struct xen_processor_performance    perf; /* Px: _PPC/_PCT/_PSS/ */
         XEN_GUEST_HANDLE(uint32)            pdc;  /* _PDC */
     } u;
+    /* Coordination type of this processor */
+#define XEN_CPUPERF_SHARED_TYPE_HW   1 /* HW does needed coordination */
+#define XEN_CPUPERF_SHARED_TYPE_ALL  2 /* All dependent CPUs should set freq */
+#define XEN_CPUPERF_SHARED_TYPE_ANY  3 /* Freq can be set from any dependent CPU */
+    uint32_t shared_type;
 };
 typedef struct xenpf_set_processor_pminfo xenpf_set_processor_pminfo_t;
 DEFINE_XEN_GUEST_HANDLE(xenpf_set_processor_pminfo_t);
diff --git a/xen/include/xen/pmstat.h b/xen/include/xen/pmstat.h
index 8350403e95..fd02316ce9 100644
--- a/xen/include/xen/pmstat.h
+++ b/xen/include/xen/pmstat.h
@@ -5,6 +5,8 @@
 #include <public/platform.h> /* for struct xen_processor_power */
 #include <public/sysctl.h>   /* for struct pm_cx_stat */
 
+int set_psd_pminfo(uint32_t acpi_id, uint32_t shared_type,
+                   const struct xen_psd_package *psd_data);
 int set_px_pminfo(uint32_t acpi_id, struct xen_processor_performance *perf);
 long set_cx_pminfo(uint32_t acpi_id, struct xen_processor_power *power);
 
diff --git a/xen/include/xlat.lst b/xen/include/xlat.lst
index 3c7b6c6830..0d964fe0ce 100644
--- a/xen/include/xlat.lst
+++ b/xen/include/xlat.lst
@@ -168,7 +168,7 @@
 !	processor_performance		platform.h
 !	processor_power			platform.h
 ?	processor_px			platform.h
-!	psd_package			platform.h
+?	psd_package			platform.h
 ?	xenpf_enter_acpi_sleep		platform.h
 ?	xenpf_pcpu_version		platform.h
 ?	xenpf_pcpuinfo			platform.h
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 02/15] xen/x86: introduce new sub-hypercall to propagate CPPC data
  2025-03-06  8:39 [PATCH v3 00/15] amd-cppc CPU Performance Scaling Driver Penny Zheng
  2025-03-06  8:39 ` [PATCH v3 01/15] xen/cpufreq: introduces XEN_PM_PSD for solely delivery of _PSD Penny Zheng
@ 2025-03-06  8:39 ` Penny Zheng
  2025-03-24 14:28   ` Jan Beulich
  2025-03-06  8:39 ` [PATCH v3 03/15] xen/cpufreq: refactor cmdline "cpufreq=xxx" Penny Zheng
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Penny Zheng @ 2025-03-06  8:39 UTC (permalink / raw)
  To: xen-devel
  Cc: ray.huang, Penny Zheng, Jan Beulich, Andrew Cooper,
	Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini

In order to provide backward compatibility with existing governors
that represent performance as frequencies, like ondemand, the _CPC
table can optionally provide processor frequency range values, Lowest
frequency and Norminal frequency, to let OS use Lowest Frequency/
Performance and Nominal Frequency/Performance as anchor points to
create linear mapping of CPPC abstract performance to CPU frequency.

As Xen is uncapable of parsing the ACPI dynamic table, this commit
introduces a new sub-hypercall to propagate required CPPC data from
dom0 kernel.

If the platform supports CPPC, the _CPC object must exist under all
processor objects. That is, Xen is not expected to support mixed mode
(CPPC & legacy PSS, _PCT, _PPC) operation, either advanced CPPC, or legacy
P-states.

This commit also introduces a new flag XEN_PM_CPPC to reflect processor
initialised in CPPC mode.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
---
v1 -> v2:
- Remove unnecessary figure braces
- Pointer-to-const for print_CPPC and set_cppc_pminfo
- Structure allocation shall use xvzalloc()
- Unnecessary memcpy(), and change it to a (type safe) structure assignment
- Add comment for struct xen_processor_cppc, and keep the chosen fields
in the order _CPC has them
- Obey to alphabetic sorting, and prefix compat structures with ? instead
of !
---
v2 -> v3:
- Trim too long line
- Re-place set_cppc_pminfo() past set_px_pminfo()
- Fix Misra violations: Declaration and definition ought to agree
in parameter names
- Introduce a new flag XEN_PM_CPPC to reflect processor initialised in CPPC
mode
---
 xen/arch/x86/platform_hypercall.c         |  5 +++
 xen/arch/x86/x86_64/cpufreq.c             |  2 +
 xen/drivers/acpi/pmstat.c                 |  4 +-
 xen/drivers/cpufreq/cpufreq.c             | 53 +++++++++++++++++++++--
 xen/include/acpi/cpufreq/processor_perf.h |  8 ++--
 xen/include/public/platform.h             | 16 +++++++
 xen/include/xen/pmstat.h                  |  2 +
 xen/include/xlat.lst                      |  1 +
 8 files changed, 82 insertions(+), 9 deletions(-)

diff --git a/xen/arch/x86/platform_hypercall.c b/xen/arch/x86/platform_hypercall.c
index b0d98b5840..77390a0dbd 100644
--- a/xen/arch/x86/platform_hypercall.c
+++ b/xen/arch/x86/platform_hypercall.c
@@ -583,6 +583,11 @@ ret_t do_platform_op(
                                  &op->u.set_pminfo.u.domain_info);
             break;
 
+        case XEN_PM_CPPC:
+            ret = set_cppc_pminfo(op->u.set_pminfo.id,
+                                  &op->u.set_pminfo.u.cppc_data);
+            break;
+
         default:
             ret = -EINVAL;
             break;
diff --git a/xen/arch/x86/x86_64/cpufreq.c b/xen/arch/x86/x86_64/cpufreq.c
index d1b93b8eef..565e4f8652 100644
--- a/xen/arch/x86/x86_64/cpufreq.c
+++ b/xen/arch/x86/x86_64/cpufreq.c
@@ -26,6 +26,8 @@
 #include <xen/pmstat.h>
 #include <compat/platform.h>
 
+CHECK_processor_cppc;
+
 CHECK_processor_px;
 
 CHECK_psd_package;
diff --git a/xen/drivers/acpi/pmstat.c b/xen/drivers/acpi/pmstat.c
index df309e27b4..c8e00766a6 100644
--- a/xen/drivers/acpi/pmstat.c
+++ b/xen/drivers/acpi/pmstat.c
@@ -68,7 +68,7 @@ int do_get_pm_info(struct xen_sysctl_get_pmstat *op)
             return -ENODEV;
         if ( hwp_active() )
             return -EOPNOTSUPP;
-        if ( !pmpt || !(pmpt->perf.init & XEN_PX_INIT) )
+        if ( !pmpt || !(pmpt->init & XEN_PX_INIT) )
             return -EINVAL;
         break;
     default:
@@ -467,7 +467,7 @@ int do_pm_op(struct xen_sysctl_pm_op *op)
     case CPUFREQ_PARA:
         if ( !(xen_processor_pmbits & XEN_PROCESSOR_PM_PX) )
             return -ENODEV;
-        if ( !pmpt || !(pmpt->perf.init & XEN_PX_INIT) )
+        if ( !pmpt || !(pmpt->init & (XEN_PX_INIT | XEN_CPPC_INIT)) )
             return -EINVAL;
         break;
     }
diff --git a/xen/drivers/cpufreq/cpufreq.c b/xen/drivers/cpufreq/cpufreq.c
index 638476ca15..894bafebaa 100644
--- a/xen/drivers/cpufreq/cpufreq.c
+++ b/xen/drivers/cpufreq/cpufreq.c
@@ -210,7 +210,7 @@ int cpufreq_add_cpu(unsigned int cpu)
 
     pmpt = processor_pminfo[cpu];
 
-    if ( !(pmpt->perf.init & XEN_PX_INIT) )
+    if ( !(pmpt->init & (XEN_PX_INIT | XEN_CPPC_INIT)) )
         return -EINVAL;
 
     if (!cpufreq_driver.init)
@@ -368,7 +368,7 @@ int cpufreq_del_cpu(unsigned int cpu)
 
     pmpt = processor_pminfo[cpu];
 
-    if ( !(pmpt->perf.init & XEN_PX_INIT) )
+    if ( !(pmpt->init & (XEN_PX_INIT | XEN_CPPC_INIT)) )
         return -EINVAL;
 
     if (!per_cpu(cpufreq_cpu_policy, cpu))
@@ -459,6 +459,16 @@ static void print_PPC(unsigned int platform_limit)
     printk("\t_PPC: %d\n", platform_limit);
 }
 
+static void print_CPPC(const struct xen_processor_cppc *cppc_data)
+{
+    printk("\t_CPC: highest_perf=%u, lowest_perf=%u, "
+           "nominal_perf=%u, lowest_nonlinear_perf=%u, "
+           "nominal_mhz=%uMHz, lowest_mhz=%uMHz\n",
+           cppc_data->highest_perf, cppc_data->lowest_perf,
+           cppc_data->nominal_perf, cppc_data->lowest_nonlinear_perf,
+           cppc_data->nominal_mhz, cppc_data->lowest_mhz);
+}
+
 int set_px_pminfo(uint32_t acpi_id, struct xen_processor_performance *perf)
 {
     int ret = 0, cpu;
@@ -539,7 +549,7 @@ int set_px_pminfo(uint32_t acpi_id, struct xen_processor_performance *perf)
         if ( cpufreq_verbose )
             print_PPC(pxpt->platform_limit);
 
-        if ( pxpt->init == XEN_PX_INIT )
+        if ( pmpt->init == XEN_PX_INIT )
         {
             ret = cpufreq_limit_change(cpu);
             goto out;
@@ -548,7 +558,7 @@ int set_px_pminfo(uint32_t acpi_id, struct xen_processor_performance *perf)
 
     if ( perf->flags == ( XEN_PX_PCT | XEN_PX_PSS | XEN_PX_PPC ) )
     {
-        pxpt->init = XEN_PX_INIT;
+        pmpt->init = XEN_PX_INIT;
 
         ret = cpufreq_cpu_init(cpu);
         goto out;
@@ -606,6 +616,41 @@ int set_psd_pminfo(uint32_t acpi_id, uint32_t shared_type,
     return ret;
 }
 
+int set_cppc_pminfo(uint32_t acpi_id,
+                    const struct xen_processor_cppc *cppc_data)
+{
+    int ret = 0, cpuid;
+    struct processor_pminfo *pm_info;
+
+    cpuid = get_cpu_id(acpi_id);
+    if ( cpuid < 0 || !cppc_data )
+    {
+        ret = -EINVAL;
+        goto out;
+    }
+    if ( cpufreq_verbose )
+        printk("Set CPU acpi_id(%d) cpuid(%d) CPPC State info:\n",
+               acpi_id, cpuid);
+
+    pm_info = processor_pminfo[cpuid];
+    /* Must already allocated in set_psd_pminfo */
+    if ( !pm_info )
+    {
+        ret = -EINVAL;
+        goto out;
+    }
+    pm_info->cppc_data = *cppc_data;
+
+    if ( cpufreq_verbose )
+        print_CPPC(&pm_info->cppc_data);
+
+    pm_info->init = XEN_CPPC_INIT;
+    ret = cpufreq_cpu_init(cpuid);
+
+ out:
+    return ret;
+}
+
 static void cpufreq_cmdline_common_para(struct cpufreq_policy *new_policy)
 {
     if (usr_max_freq)
diff --git a/xen/include/acpi/cpufreq/processor_perf.h b/xen/include/acpi/cpufreq/processor_perf.h
index 19f5de6b08..12b6e6b826 100644
--- a/xen/include/acpi/cpufreq/processor_perf.h
+++ b/xen/include/acpi/cpufreq/processor_perf.h
@@ -5,7 +5,8 @@
 #include <public/sysctl.h>
 #include <xen/acpi.h>
 
-#define XEN_PX_INIT 0x80000000U
+#define XEN_CPPC_INIT 0x40000000U
+#define XEN_PX_INIT   0x80000000U
 
 unsigned int powernow_register_driver(void);
 unsigned int get_measured_perf(unsigned int cpu, unsigned int flag);
@@ -27,8 +28,6 @@ struct processor_performance {
     struct xen_pct_register status_register;
     uint32_t state_count;
     struct xen_processor_px *states;
-
-    uint32_t init;
 };
 
 struct processor_pminfo {
@@ -37,6 +36,9 @@ struct processor_pminfo {
     struct xen_psd_package domain_info;
     uint32_t shared_type;
     struct processor_performance    perf;
+    struct xen_processor_cppc cppc_data;
+
+    uint32_t init;
 };
 
 extern struct processor_pminfo *processor_pminfo[NR_CPUS];
diff --git a/xen/include/public/platform.h b/xen/include/public/platform.h
index f5c50380cb..07f4b72014 100644
--- a/xen/include/public/platform.h
+++ b/xen/include/public/platform.h
@@ -364,6 +364,7 @@ DEFINE_XEN_GUEST_HANDLE(xenpf_getidletime_t);
 #define XEN_PM_TX   2
 #define XEN_PM_PDC  3
 #define XEN_PM_PSD  4
+#define XEN_PM_CPPC 5
 
 /* Px sub info type */
 #define XEN_PX_PCT   1
@@ -432,6 +433,20 @@ struct xen_processor_px {
 typedef struct xen_processor_px xen_processor_px_t;
 DEFINE_XEN_GUEST_HANDLE(xen_processor_px_t);
 
+/*
+ * Subset _CPC fields useful for CPPC-compatible cpufreq
+ * driver's initialization
+ */
+struct xen_processor_cppc {
+    uint32_t highest_perf;
+    uint32_t nominal_perf;
+    uint32_t lowest_nonlinear_perf;
+    uint32_t lowest_perf;
+    uint32_t lowest_mhz;
+    uint32_t nominal_mhz;
+};
+typedef struct xen_processor_cppc xen_processor_cppc_t;
+
 struct xen_psd_package {
     uint64_t num_entries;
     uint64_t revision;
@@ -461,6 +476,7 @@ struct xenpf_set_processor_pminfo {
         xen_psd_package_t                   domain_info; /* _PSD */
         struct xen_processor_performance    perf; /* Px: _PPC/_PCT/_PSS/ */
         XEN_GUEST_HANDLE(uint32)            pdc;  /* _PDC */
+        xen_processor_cppc_t                cppc_data; /*_CPC */
     } u;
     /* Coordination type of this processor */
 #define XEN_CPUPERF_SHARED_TYPE_HW   1 /* HW does needed coordination */
diff --git a/xen/include/xen/pmstat.h b/xen/include/xen/pmstat.h
index fd02316ce9..c223f417fd 100644
--- a/xen/include/xen/pmstat.h
+++ b/xen/include/xen/pmstat.h
@@ -8,6 +8,8 @@
 int set_psd_pminfo(uint32_t acpi_id, uint32_t shared_type,
                    const struct xen_psd_package *psd_data);
 int set_px_pminfo(uint32_t acpi_id, struct xen_processor_performance *perf);
+int set_cppc_pminfo(uint32_t acpi_id,
+                    const struct xen_processor_cppc *cppc_data);
 long set_cx_pminfo(uint32_t acpi_id, struct xen_processor_power *power);
 
 #ifdef CONFIG_COMPAT
diff --git a/xen/include/xlat.lst b/xen/include/xlat.lst
index 0d964fe0ce..3f47552a22 100644
--- a/xen/include/xlat.lst
+++ b/xen/include/xlat.lst
@@ -162,6 +162,7 @@
 
 !	pct_register			platform.h
 !	power_register			platform.h
+?	processor_cppc			platform.h
 ?	processor_csd			platform.h
 !	processor_cx			platform.h
 !	processor_flags			platform.h
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 03/15] xen/cpufreq: refactor cmdline "cpufreq=xxx"
  2025-03-06  8:39 [PATCH v3 00/15] amd-cppc CPU Performance Scaling Driver Penny Zheng
  2025-03-06  8:39 ` [PATCH v3 01/15] xen/cpufreq: introduces XEN_PM_PSD for solely delivery of _PSD Penny Zheng
  2025-03-06  8:39 ` [PATCH v3 02/15] xen/x86: introduce new sub-hypercall to propagate CPPC data Penny Zheng
@ 2025-03-06  8:39 ` Penny Zheng
  2025-03-24 15:00   ` Jan Beulich
  2025-03-06  8:39 ` [PATCH v3 04/15] xen/cpufreq: move XEN_PROCESSOR_PM_xxx to internal header Penny Zheng
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Penny Zheng @ 2025-03-06  8:39 UTC (permalink / raw)
  To: xen-devel
  Cc: ray.huang, Penny Zheng, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Jan Beulich, Julien Grall, Roger Pau Monné,
	Stefano Stabellini

This commit includes the following modification:
- Introduce helper function cpufreq_cmdline_parse_xen and
cpufreq_cmdline_parse_hwp to tidy the different parsing path
- Add helper cpufreq_opts_contain to ignore user redundant setting,
like "cpufreq=hwp;hwp;xen"
- Doc refinement

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
---
v2 -> v3:
- new commit
---
 docs/misc/xen-command-line.pandoc |  3 +-
 xen/drivers/cpufreq/cpufreq.c     | 64 ++++++++++++++++++++++---------
 2 files changed, 48 insertions(+), 19 deletions(-)

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 0c6225391d..a440042471 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -535,7 +535,8 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
   processor to autonomously force physical package components into idle state.
   The default is enabled, but the option only applies when `hwp` is enabled.
 
-There is also support for `;`-separated fallback options:
+User could use `;`-separated options to support universal options which they
+would like to try on any agnostic platform, *but* under priority order, like
 `cpufreq=hwp;xen,verbose`.  This first tries `hwp` and falls back to `xen` if
 unavailable.  Note: The `verbose` suboption is handled globally.  Setting it
 for either the primary or fallback option applies to both irrespective of where
diff --git a/xen/drivers/cpufreq/cpufreq.c b/xen/drivers/cpufreq/cpufreq.c
index 894bafebaa..cfae16c15f 100644
--- a/xen/drivers/cpufreq/cpufreq.c
+++ b/xen/drivers/cpufreq/cpufreq.c
@@ -71,6 +71,46 @@ unsigned int __initdata cpufreq_xen_cnt = 1;
 
 static int __init cpufreq_cmdline_parse(const char *s, const char *e);
 
+static bool __init cpufreq_opts_contain(enum cpufreq_xen_opt option)
+{
+    unsigned int count = cpufreq_xen_cnt;
+
+    while ( count )
+    {
+        if ( cpufreq_xen_opts[--count] == option )
+            return true;
+    }
+
+    return false;
+}
+
+static int __init cpufreq_cmdline_parse_xen(const char *arg, const char *end)
+{
+    int ret = 0;
+
+    xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
+    cpufreq_controller = FREQCTL_xen;
+    cpufreq_xen_opts[cpufreq_xen_cnt++] = CPUFREQ_xen;
+    ret = 0;
+    if ( arg[0] && arg[1] )
+        ret = cpufreq_cmdline_parse(arg + 1, end);
+
+    return ret;
+}
+
+static int __init cpufreq_cmdline_parse_hwp(const char *arg, const char *end)
+{
+    int ret = 0;
+
+    xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
+    cpufreq_controller = FREQCTL_xen;
+    cpufreq_xen_opts[cpufreq_xen_cnt++] = CPUFREQ_hwp;
+    if ( arg[0] && arg[1] )
+        ret = hwp_cmdline_parse(arg + 1, end);
+
+    return ret;
+}
+
 static int __init cf_check setup_cpufreq_option(const char *str)
 {
     const char *arg = strpbrk(str, ",:;");
@@ -112,25 +152,13 @@ static int __init cf_check setup_cpufreq_option(const char *str)
         if ( cpufreq_xen_cnt == ARRAY_SIZE(cpufreq_xen_opts) )
             return -E2BIG;
 
-        if ( choice > 0 || !cmdline_strcmp(str, "xen") )
-        {
-            xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
-            cpufreq_controller = FREQCTL_xen;
-            cpufreq_xen_opts[cpufreq_xen_cnt++] = CPUFREQ_xen;
-            ret = 0;
-            if ( arg[0] && arg[1] )
-                ret = cpufreq_cmdline_parse(arg + 1, end);
-        }
+        if ( (choice > 0 || !cmdline_strcmp(str, "xen")) &&
+             !cpufreq_opts_contain(CPUFREQ_xen) )
+            ret = cpufreq_cmdline_parse_xen(arg, end);
         else if ( IS_ENABLED(CONFIG_INTEL) && choice < 0 &&
-                  !cmdline_strcmp(str, "hwp") )
-        {
-            xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
-            cpufreq_controller = FREQCTL_xen;
-            cpufreq_xen_opts[cpufreq_xen_cnt++] = CPUFREQ_hwp;
-            ret = 0;
-            if ( arg[0] && arg[1] )
-                ret = hwp_cmdline_parse(arg + 1, end);
-        }
+                  !cmdline_strcmp(str, "hwp") &&
+                  !cpufreq_opts_contain(CPUFREQ_hwp) )
+            ret = cpufreq_cmdline_parse_hwp(arg, end);
         else
             ret = -EINVAL;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 04/15] xen/cpufreq: move XEN_PROCESSOR_PM_xxx to internal header
  2025-03-06  8:39 [PATCH v3 00/15] amd-cppc CPU Performance Scaling Driver Penny Zheng
                   ` (2 preceding siblings ...)
  2025-03-06  8:39 ` [PATCH v3 03/15] xen/cpufreq: refactor cmdline "cpufreq=xxx" Penny Zheng
@ 2025-03-06  8:39 ` Penny Zheng
  2025-03-24 15:11   ` Jan Beulich
  2025-03-06  8:39 ` [PATCH v3 05/15] xen/x86: introduce "cpufreq=amd-cppc" xen cmdline Penny Zheng
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Penny Zheng @ 2025-03-06  8:39 UTC (permalink / raw)
  To: xen-devel
  Cc: ray.huang, Penny Zheng, Jan Beulich, Andrew Cooper,
	Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini

XEN_PROCESSOR_PM_xxx are used to set xen_processor_pmbits only, which is
a Xen-internal variable only. Although PV Dom0 passed these bits in si->flags,
they haven't been used anywhere.
So this commit moves XEN_PROCESSOR_PM_xxx back to internal header
"acpi/cpufreq/processor_perf.h"

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
---
v2 -> v3:
- new commit
---
 xen/arch/x86/pv/dom0_build.c              | 1 -
 xen/arch/x86/setup.c                      | 1 +
 xen/common/domain.c                       | 1 +
 xen/include/acpi/cpufreq/processor_perf.h | 5 +++++
 xen/include/public/platform.h             | 5 -----
 xen/include/public/xen.h                  | 1 -
 6 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 96e28c7b6a..a62948b0e8 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -886,7 +886,6 @@ static int __init dom0_construct(struct boot_info *bi, struct domain *d)
         si->flags    = SIF_PRIVILEGED | SIF_INITDOMAIN;
     if ( !vinitrd_start && initrd_len )
         si->flags   |= SIF_MOD_START_PFN;
-    si->flags       |= MASK_INSR(xen_processor_pmbits, SIF_PM_MASK);
     si->pt_base      = vpt_start;
     si->nr_pt_frames = nr_pt_pages;
     si->mfn_list     = vphysmap_start;
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 8ebe5a9443..5101b381fe 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -62,6 +62,7 @@
 #include <asm/prot-key.h>
 #include <asm/pv/domain.h>
 #include <asm/trampoline.h>
+#include <acpi/cpufreq/cpufreq.h>
 
 /* opt_nosmp: If true, secondary processors are ignored. */
 static bool __initdata opt_nosmp;
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 0c4cc77111..05cfa1d885 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -43,6 +43,7 @@
 #include <xsm/xsm.h>
 #include <xen/trace.h>
 #include <asm/setup.h>
+#include <acpi/cpufreq/cpufreq.h>
 
 #ifdef CONFIG_X86
 #include <asm/guest.h>
diff --git a/xen/include/acpi/cpufreq/processor_perf.h b/xen/include/acpi/cpufreq/processor_perf.h
index 12b6e6b826..33edf112a0 100644
--- a/xen/include/acpi/cpufreq/processor_perf.h
+++ b/xen/include/acpi/cpufreq/processor_perf.h
@@ -5,6 +5,11 @@
 #include <public/sysctl.h>
 #include <xen/acpi.h>
 
+/* ability bits */
+#define XEN_PROCESSOR_PM_CX 1
+#define XEN_PROCESSOR_PM_PX 2
+#define XEN_PROCESSOR_PM_TX 4
+
 #define XEN_CPPC_INIT 0x40000000U
 #define XEN_PX_INIT   0x80000000U
 
diff --git a/xen/include/public/platform.h b/xen/include/public/platform.h
index 07f4b72014..24cc5812ed 100644
--- a/xen/include/public/platform.h
+++ b/xen/include/public/platform.h
@@ -353,11 +353,6 @@ DEFINE_XEN_GUEST_HANDLE(xenpf_getidletime_t);
 
 #define XENPF_set_processor_pminfo      54
 
-/* ability bits */
-#define XEN_PROCESSOR_PM_CX	1
-#define XEN_PROCESSOR_PM_PX	2
-#define XEN_PROCESSOR_PM_TX	4
-
 /* cmd type */
 #define XEN_PM_CX   0
 #define XEN_PM_PX   1
diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h
index e051f989a5..941d288ec1 100644
--- a/xen/include/public/xen.h
+++ b/xen/include/public/xen.h
@@ -877,7 +877,6 @@ typedef struct start_info start_info_t;
 #define SIF_MOD_START_PFN (1<<3)  /* Is mod_start a PFN? */
 #define SIF_VIRT_P2M_4TOOLS (1<<4) /* Do Xen tools understand a virt. mapped */
                                    /* P->M making the 3 level tree obsolete? */
-#define SIF_PM_MASK       (0xFF<<8) /* reserve 1 byte for xen-pm options */
 
 /*
  * A multiboot module is a package containing modules very similar to a
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 05/15] xen/x86: introduce "cpufreq=amd-cppc" xen cmdline
  2025-03-06  8:39 [PATCH v3 00/15] amd-cppc CPU Performance Scaling Driver Penny Zheng
                   ` (3 preceding siblings ...)
  2025-03-06  8:39 ` [PATCH v3 04/15] xen/cpufreq: move XEN_PROCESSOR_PM_xxx to internal header Penny Zheng
@ 2025-03-06  8:39 ` Penny Zheng
  2025-03-24 15:26   ` Jan Beulich
  2025-03-25 10:00   ` Jan Beulich
  2025-03-06  8:39 ` [PATCH v3 06/15] xen/cpufreq: disable px statistic info in amd-cppc mode Penny Zheng
                   ` (9 subsequent siblings)
  14 siblings, 2 replies; 60+ messages in thread
From: Penny Zheng @ 2025-03-06  8:39 UTC (permalink / raw)
  To: xen-devel
  Cc: ray.huang, Penny Zheng, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Jan Beulich, Julien Grall, Roger Pau Monné,
	Stefano Stabellini

Users need to set "cpufreq=amd-cppc" in xen cmdline to enable
amd-cppc driver, which selects ACPI Collaborative Performance
and Power Control (CPPC) on supported AMD hardware to provide a
finer grained frequency control mechanism.
`verbose` option can also be included to support verbose print.

When users setting "cpufreq=amd-cppc", a new amd-cppc driver
shall be registered and used. Actual implmentation will be introduced
in the following commits.

Xen is not expected to support both or mixed mode (CPPC & legacy PSS, _PCT,
_PPC) operations, only one cpufreq driver gets registerd, either amd-cppc or
legacy P-states driver, which is reflected and asserted by the incompatible
flags XEN_PROCESSOR_PM_PX and XEN_PROCESSOR_PM_CPPC.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
---
v1 -> v2:
- Obey to alphabetic sorting and also strict it with CONFIG_AMD
- Remove unnecessary empty comment line
- Use __initconst_cf_clobber for pre-filled structure cpufreq_driver
- Make new switch-case code apply to Hygon CPUs too
- Change ENOSYS with EOPNOTSUPP
- Blanks around binary operator
- Change all amd_/-pstate defined values to amd_/-cppc
---
v2 -> v3
- refactor too long lines
- Make sure XEN_PROCESSOR_PM_PX and XEN_PROCESSOR_PM_CPPC incompatible flags
after cpufreq register registrantion
---
 docs/misc/xen-command-line.pandoc         | 16 +++--
 xen/arch/x86/acpi/cpufreq/Makefile        |  1 +
 xen/arch/x86/acpi/cpufreq/acpi.c          | 12 +++-
 xen/arch/x86/acpi/cpufreq/amd-cppc.c      | 78 +++++++++++++++++++++++
 xen/arch/x86/acpi/cpufreq/cpufreq.c       | 34 +++++++++-
 xen/arch/x86/platform_hypercall.c         | 11 +++-
 xen/drivers/cpufreq/cpufreq.c             | 17 +++++
 xen/include/acpi/cpufreq/cpufreq.h        |  4 ++
 xen/include/acpi/cpufreq/processor_perf.h |  7 +-
 xen/include/public/sysctl.h               |  1 +
 10 files changed, 169 insertions(+), 12 deletions(-)
 create mode 100644 xen/arch/x86/acpi/cpufreq/amd-cppc.c

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index a440042471..b3c3ca2377 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -515,7 +515,7 @@ If set, force use of the performance counters for oprofile, rather than detectin
 available support.
 
 ### cpufreq
-> `= none | {{ <boolean> | xen } { [:[powersave|performance|ondemand|userspace][,[<maxfreq>]][,[<minfreq>]]] } [,verbose]} | dom0-kernel | hwp[:[<hdc>][,verbose]]`
+> `= none | {{ <boolean> | xen } { [:[powersave|performance|ondemand|userspace][,[<maxfreq>]][,[<minfreq>]]] } [,verbose]} | dom0-kernel | hwp[:[<hdc>][,verbose]] | amd-cppc[:[verbose]]`
 
 > Default: `xen`
 
@@ -526,7 +526,7 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
 * `<maxfreq>` and `<minfreq>` are integers which represent max and min processor frequencies
   respectively.
 * `verbose` option can be included as a string or also as `verbose=<integer>`
-  for `xen`.  It is a boolean for `hwp`.
+  for `xen`.  It is a boolean for `hwp` and `amd-cppc`.
 * `hwp` selects Hardware-Controlled Performance States (HWP) on supported Intel
   hardware.  HWP is a Skylake+ feature which provides better CPU power
   management.  The default is disabled.  If `hwp` is selected, but hardware
@@ -534,13 +534,17 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
 * `<hdc>` is a boolean to enable Hardware Duty Cycling (HDC).  HDC enables the
   processor to autonomously force physical package components into idle state.
   The default is enabled, but the option only applies when `hwp` is enabled.
+* `amd-cppc` selects ACPI Collaborative Performance and Power Control (CPPC)
+  on supported AMD hardware to provide finer grained frequency control
+  mechanism. The default is disabled.
 
 User could use `;`-separated options to support universal options which they
 would like to try on any agnostic platform, *but* under priority order, like
-`cpufreq=hwp;xen,verbose`.  This first tries `hwp` and falls back to `xen` if
-unavailable.  Note: The `verbose` suboption is handled globally.  Setting it
-for either the primary or fallback option applies to both irrespective of where
-it is specified.
+`cpufreq=hwp;amd-cppc;xen,verbose`. This first tries `hwp` on Intel, or
+`amd-cppc` on AMD, and it will fall back to `xen` if unavailable. Note:
+The `verbose` suboption is handled globally.  Setting it for either the
+primary or fallback option applies to both irrespective of where it is
+specified.
 
 Note: grub2 requires to escape or quote ';', so `"cpufreq=hwp;xen"` should be
 specified within double quotes inside grub.cfg.  Refer to the grub2
diff --git a/xen/arch/x86/acpi/cpufreq/Makefile b/xen/arch/x86/acpi/cpufreq/Makefile
index e7dbe434a8..a2ba34bda0 100644
--- a/xen/arch/x86/acpi/cpufreq/Makefile
+++ b/xen/arch/x86/acpi/cpufreq/Makefile
@@ -1,4 +1,5 @@
 obj-$(CONFIG_INTEL) += acpi.o
+obj-$(CONFIG_AMD) += amd-cppc.o
 obj-y += cpufreq.o
 obj-$(CONFIG_INTEL) += hwp.o
 obj-$(CONFIG_AMD) += powernow.o
diff --git a/xen/arch/x86/acpi/cpufreq/acpi.c b/xen/arch/x86/acpi/cpufreq/acpi.c
index 0cf94ab2d6..49c1795b25 100644
--- a/xen/arch/x86/acpi/cpufreq/acpi.c
+++ b/xen/arch/x86/acpi/cpufreq/acpi.c
@@ -13,6 +13,7 @@
 
 #include <xen/errno.h>
 #include <xen/delay.h>
+#include <xen/domain.h>
 #include <xen/param.h>
 #include <xen/types.h>
 
@@ -514,5 +515,14 @@ acpi_cpufreq_driver = {
 
 int __init acpi_cpufreq_register(void)
 {
-    return cpufreq_register_driver(&acpi_cpufreq_driver);
+    int ret;
+
+    ret = cpufreq_register_driver(&acpi_cpufreq_driver);
+    if ( ret )
+        return ret;
+
+    if ( IS_ENABLED(CONFIG_AMD) )
+        xen_processor_pmbits &= ~XEN_PROCESSOR_PM_CPPC;
+
+    return ret;
 }
diff --git a/xen/arch/x86/acpi/cpufreq/amd-cppc.c b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
new file mode 100644
index 0000000000..7d482140a2
--- /dev/null
+++ b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
@@ -0,0 +1,78 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * amd-cppc.c - AMD Processor CPPC Frequency Driver
+ *
+ * Copyright (C) 2025 Advanced Micro Devices, Inc. All Rights Reserved.
+ *
+ * Author: Penny Zheng <penny.zheng@amd.com>
+ *
+ * AMD CPPC cpufreq driver introduces a new CPU performance scaling design
+ * for AMD processors using the ACPI Collaborative Performance and Power
+ * Control (CPPC) feature which provides finer grained frequency control range.
+ */
+
+#include <xen/domain.h>
+#include <xen/init.h>
+#include <xen/param.h>
+#include <acpi/cpufreq/cpufreq.h>
+
+static bool __init amd_cppc_handle_option(const char *s, const char *end)
+{
+    int ret;
+
+    ret = parse_boolean("verbose", s, end);
+    if ( ret >= 0 )
+    {
+        cpufreq_verbose = ret;
+        return true;
+    }
+
+    return false;
+}
+
+int __init amd_cppc_cmdline_parse(const char *s, const char *e)
+{
+    do
+    {
+        const char *end = strpbrk(s, ",;");
+
+        if ( !amd_cppc_handle_option(s, end) )
+        {
+            printk(XENLOG_WARNING
+                   "cpufreq/amd-cppc: option '%.*s' not recognized\n",
+                   (int)((end ?: e) - s), s);
+
+            return -EINVAL;
+        }
+
+        s = end ? end + 1 : NULL;
+    } while ( s && s < e );
+
+    return 0;
+}
+
+static const struct cpufreq_driver __initconst_cf_clobber
+amd_cppc_cpufreq_driver =
+{
+    .name   = XEN_AMD_CPPC_DRIVER_NAME,
+};
+
+int __init amd_cppc_register_driver(void)
+{
+    int ret;
+
+    if ( !cpu_has_cppc )
+    {
+        xen_processor_pmbits &= ~XEN_PROCESSOR_PM_CPPC;
+        return -ENODEV;
+    }
+
+    ret = cpufreq_register_driver(&amd_cppc_cpufreq_driver);
+    if ( ret )
+        return ret;
+
+    /* Remove possible fallback option */
+    xen_processor_pmbits &= ~XEN_PROCESSOR_PM_PX;
+
+    return ret;
+}
diff --git a/xen/arch/x86/acpi/cpufreq/cpufreq.c b/xen/arch/x86/acpi/cpufreq/cpufreq.c
index 61e98b67bd..690a285f11 100644
--- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
+++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
@@ -148,6 +148,10 @@ static int __init cf_check cpufreq_driver_init(void)
                 case CPUFREQ_none:
                     ret = 0;
                     break;
+                default:
+                    printk(XENLOG_WARNING
+                           "Unsupported cpufreq driver for vendor Intel\n");
+                    break;
                 }
 
                 if ( ret != -ENODEV )
@@ -157,7 +161,35 @@ static int __init cf_check cpufreq_driver_init(void)
 
         case X86_VENDOR_AMD:
         case X86_VENDOR_HYGON:
-            ret = IS_ENABLED(CONFIG_AMD) ? powernow_register_driver() : -ENODEV;
+            if ( !IS_ENABLED(CONFIG_AMD) )
+            {
+                ret = -ENODEV;
+                break;
+            }
+            ret = -ENOENT;
+
+            for ( unsigned int i = 0; i < cpufreq_xen_cnt; i++ )
+            {
+                switch ( cpufreq_xen_opts[i] )
+                {
+                case CPUFREQ_xen:
+                    ret = powernow_register_driver();
+                    break;
+                case CPUFREQ_amd_cppc:
+                    ret = amd_cppc_register_driver();
+                    break;
+                case CPUFREQ_none:
+                    ret = 0;
+                    break;
+                default:
+                    printk(XENLOG_WARNING
+                           "Unsupported cpufreq driver for vendor AMD\n");
+                    break;
+                }
+
+                if ( ret != -ENODEV )
+                    break;
+            }
             break;
         }
     }
diff --git a/xen/arch/x86/platform_hypercall.c b/xen/arch/x86/platform_hypercall.c
index 77390a0dbd..5dd1ba2949 100644
--- a/xen/arch/x86/platform_hypercall.c
+++ b/xen/arch/x86/platform_hypercall.c
@@ -542,6 +542,7 @@ ret_t do_platform_op(
                 ret = -ENOSYS;
                 break;
             }
+            ASSERT(!(xen_processor_pmbits & XEN_PROCESSOR_PM_CPPC));
             ret = set_px_pminfo(op->u.set_pminfo.id, &op->u.set_pminfo.u.perf);
             break;
  
@@ -572,7 +573,8 @@ ret_t do_platform_op(
             break;
         }
         case XEN_PM_PSD:
-            if ( !(xen_processor_pmbits & XEN_PROCESSOR_PM_PX) )
+            if ( !(xen_processor_pmbits & (XEN_PROCESSOR_PM_PX |
+                                           XEN_PROCESSOR_PM_CPPC)) )
             {
                 ret = -EOPNOTSUPP;
                 break;
@@ -584,6 +586,13 @@ ret_t do_platform_op(
             break;
 
         case XEN_PM_CPPC:
+            if ( !(xen_processor_pmbits & XEN_PROCESSOR_PM_CPPC) )
+            {
+                ret = -EOPNOTSUPP;
+                break;
+            }
+            ASSERT(!(xen_processor_pmbits & XEN_PROCESSOR_PM_PX));
+
             ret = set_cppc_pminfo(op->u.set_pminfo.id,
                                   &op->u.set_pminfo.u.cppc_data);
             break;
diff --git a/xen/drivers/cpufreq/cpufreq.c b/xen/drivers/cpufreq/cpufreq.c
index cfae16c15f..792e4dc02c 100644
--- a/xen/drivers/cpufreq/cpufreq.c
+++ b/xen/drivers/cpufreq/cpufreq.c
@@ -111,6 +111,19 @@ static int __init cpufreq_cmdline_parse_hwp(const char *arg, const char *end)
     return ret;
 }
 
+static int __init cpufreq_cmdline_parse_cppc(const char *arg, const char *end)
+{
+    int ret = 0;
+
+    xen_processor_pmbits |= XEN_PROCESSOR_PM_CPPC;
+    cpufreq_controller = FREQCTL_xen;
+    cpufreq_xen_opts[cpufreq_xen_cnt++] = CPUFREQ_amd_cppc;
+    if ( arg[0] && arg[1] )
+        ret = amd_cppc_cmdline_parse(arg + 1, end);
+
+    return ret;
+}
+
 static int __init cf_check setup_cpufreq_option(const char *str)
 {
     const char *arg = strpbrk(str, ",:;");
@@ -159,6 +172,10 @@ static int __init cf_check setup_cpufreq_option(const char *str)
                   !cmdline_strcmp(str, "hwp") &&
                   !cpufreq_opts_contain(CPUFREQ_hwp) )
             ret = cpufreq_cmdline_parse_hwp(arg, end);
+        else if ( IS_ENABLED(CONFIG_AMD) && choice < 0 &&
+                  !cmdline_strcmp(str, "amd-cppc") &&
+                  !cpufreq_opts_contain(CPUFREQ_amd_cppc) )
+            ret = cpufreq_cmdline_parse_cppc(arg, end);
         else
             ret = -EINVAL;
 
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 3f1b05a02e..a6fb10ea27 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -28,6 +28,7 @@ enum cpufreq_xen_opt {
     CPUFREQ_none,
     CPUFREQ_xen,
     CPUFREQ_hwp,
+    CPUFREQ_amd_cppc,
 };
 extern enum cpufreq_xen_opt cpufreq_xen_opts[2];
 extern unsigned int cpufreq_xen_cnt;
@@ -267,4 +268,7 @@ int set_hwp_para(struct cpufreq_policy *policy,
 
 int acpi_cpufreq_register(void);
 
+int amd_cppc_cmdline_parse(const char *s, const char *e);
+int amd_cppc_register_driver(void);
+
 #endif /* __XEN_CPUFREQ_PM_H__ */
diff --git a/xen/include/acpi/cpufreq/processor_perf.h b/xen/include/acpi/cpufreq/processor_perf.h
index 33edf112a0..ee12e0192b 100644
--- a/xen/include/acpi/cpufreq/processor_perf.h
+++ b/xen/include/acpi/cpufreq/processor_perf.h
@@ -6,9 +6,10 @@
 #include <xen/acpi.h>
 
 /* ability bits */
-#define XEN_PROCESSOR_PM_CX 1
-#define XEN_PROCESSOR_PM_PX 2
-#define XEN_PROCESSOR_PM_TX 4
+#define XEN_PROCESSOR_PM_CX     1
+#define XEN_PROCESSOR_PM_PX     2
+#define XEN_PROCESSOR_PM_TX     4
+#define XEN_PROCESSOR_PM_CPPC   8
 
 #define XEN_CPPC_INIT 0x40000000U
 #define XEN_PX_INIT   0x80000000U
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index b0fec271d3..42997252ef 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -423,6 +423,7 @@ struct xen_set_cppc_para {
     uint32_t activity_window;
 };
 
+#define XEN_AMD_CPPC_DRIVER_NAME "amd-cppc"
 #define XEN_HWP_DRIVER_NAME "hwp"
 
 /*
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 06/15] xen/cpufreq: disable px statistic info in amd-cppc mode
  2025-03-06  8:39 [PATCH v3 00/15] amd-cppc CPU Performance Scaling Driver Penny Zheng
                   ` (4 preceding siblings ...)
  2025-03-06  8:39 ` [PATCH v3 05/15] xen/x86: introduce "cpufreq=amd-cppc" xen cmdline Penny Zheng
@ 2025-03-06  8:39 ` Penny Zheng
  2025-03-24 15:34   ` Jan Beulich
  2025-03-06  8:39 ` [PATCH v3 07/15] xen/cpufreq: fix core frequency calculation for AMD Family 1Ah CPUs Penny Zheng
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Penny Zheng @ 2025-03-06  8:39 UTC (permalink / raw)
  To: xen-devel; +Cc: ray.huang, Penny Zheng, Jan Beulich

Bypass cnstruction and deconstruction for px statistic info(
cpufreq_statistic_init and cpufreq_statistic_exit) in cpufreq
CPPC mode.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
---
v2 -> v3:
- new commit
---
 xen/drivers/cpufreq/utility.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/xen/drivers/cpufreq/utility.c b/xen/drivers/cpufreq/utility.c
index e690a484f1..f1fd2fdbce 100644
--- a/xen/drivers/cpufreq/utility.c
+++ b/xen/drivers/cpufreq/utility.c
@@ -98,6 +98,9 @@ int cpufreq_statistic_init(unsigned int cpu)
     if ( !pmpt )
         return -EINVAL;
 
+    if ( !(pmpt->init & XEN_PX_INIT) )
+        return 0;
+
     spin_lock(cpufreq_statistic_lock);
 
     pxpt = per_cpu(cpufreq_statistic_data, cpu);
@@ -147,8 +150,12 @@ int cpufreq_statistic_init(unsigned int cpu)
 void cpufreq_statistic_exit(unsigned int cpu)
 {
     struct pm_px *pxpt;
+    const struct processor_pminfo *pmpt = processor_pminfo[cpu];
     spinlock_t *cpufreq_statistic_lock = &per_cpu(cpufreq_statistic_lock, cpu);
 
+    if ( !(pmpt->init & XEN_PX_INIT) )
+        return;
+
     spin_lock(cpufreq_statistic_lock);
 
     pxpt = per_cpu(cpufreq_statistic_data, cpu);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 07/15] xen/cpufreq: fix core frequency calculation for AMD Family 1Ah CPUs
  2025-03-06  8:39 [PATCH v3 00/15] amd-cppc CPU Performance Scaling Driver Penny Zheng
                   ` (5 preceding siblings ...)
  2025-03-06  8:39 ` [PATCH v3 06/15] xen/cpufreq: disable px statistic info in amd-cppc mode Penny Zheng
@ 2025-03-06  8:39 ` Penny Zheng
  2025-03-24 15:47   ` Jan Beulich
  2025-03-06  8:39 ` [PATCH v3 08/15] xen/amd: export processor max frequency value Penny Zheng
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Penny Zheng @ 2025-03-06  8:39 UTC (permalink / raw)
  To: xen-devel
  Cc: ray.huang, Penny Zheng, Jan Beulich, Andrew Cooper,
	Roger Pau Monné

This commit fixes core frequency calculation for AMD Family 1Ah CPUs, due to
a change in the PStateDef MSR layout in AMD Family 1Ah+.
In AMD Family 1Ah+, Core current operating frequency in MHz is calculated as
follows:
CoreCOF = Core::X86::Msr::PStateDef[CpuFid[11:0]] * 5MHz

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
---
v2 -> v3:
- new commit
---
 xen/arch/x86/cpu/amd.c | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c
index 597b0f073d..7fb1d76798 100644
--- a/xen/arch/x86/cpu/amd.c
+++ b/xen/arch/x86/cpu/amd.c
@@ -572,12 +572,24 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
                                                           : c->cpu_core_id);
 }
 
+static uint64_t amd_parse_freq(const struct cpuinfo_x86 *c, uint64_t value)
+{
+	ASSERT(c->x86 <= 0x1A);
+
+	if (c->x86 < 0x17)
+		return (((value & 0x3f) + 0x10) * 100) >> ((value >> 6) & 7);
+	else if (c->x86 <= 0x19)
+		return ((value & 0xff) * 25 * 8) / ((value >> 8) & 0x3f);
+	else
+		return (value & 0xfff) * 5;
+}
+
 void amd_log_freq(const struct cpuinfo_x86 *c)
 {
 	unsigned int idx = 0, h;
 	uint64_t hi, lo, val;
 
-	if (c->x86 < 0x10 || c->x86 > 0x19 ||
+	if (c->x86 < 0x10 || c->x86 > 0x1A ||
 	    (c != &boot_cpu_data &&
 	     (!opt_cpu_info || (c->apicid & (c->x86_num_siblings - 1)))))
 		return;
@@ -658,19 +670,20 @@ void amd_log_freq(const struct cpuinfo_x86 *c)
 	if (!(lo >> 63))
 		return;
 
-#define FREQ(v) (c->x86 < 0x17 ? ((((v) & 0x3f) + 0x10) * 100) >> (((v) >> 6) & 7) \
-		                     : (((v) & 0xff) * 25 * 8) / (((v) >> 8) & 0x3f))
 	if (idx && idx < h &&
 	    !rdmsr_safe(0xC0010064 + idx, val) && (val >> 63) &&
 	    !rdmsr_safe(0xC0010064, hi) && (hi >> 63))
 		printk("CPU%u: %lu (%lu ... %lu) MHz\n",
-		       smp_processor_id(), FREQ(val), FREQ(lo), FREQ(hi));
+		       smp_processor_id(),
+		       amd_parse_freq(c, val),
+		       amd_parse_freq(c, lo), amd_parse_freq(c, hi));
 	else if (h && !rdmsr_safe(0xC0010064, hi) && (hi >> 63))
 		printk("CPU%u: %lu ... %lu MHz\n",
-		       smp_processor_id(), FREQ(lo), FREQ(hi));
+		       smp_processor_id(),
+		       amd_parse_freq(c, lo), amd_parse_freq(c, hi));
 	else
-		printk("CPU%u: %lu MHz\n", smp_processor_id(), FREQ(lo));
-#undef FREQ
+		printk("CPU%u: %lu MHz\n", smp_processor_id(),
+		       amd_parse_freq(c, lo));
 }
 
 void cf_check early_init_amd(struct cpuinfo_x86 *c)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 08/15] xen/amd: export processor max frequency value
  2025-03-06  8:39 [PATCH v3 00/15] amd-cppc CPU Performance Scaling Driver Penny Zheng
                   ` (6 preceding siblings ...)
  2025-03-06  8:39 ` [PATCH v3 07/15] xen/cpufreq: fix core frequency calculation for AMD Family 1Ah CPUs Penny Zheng
@ 2025-03-06  8:39 ` Penny Zheng
  2025-03-24 15:52   ` Jan Beulich
  2025-03-06  8:39 ` [PATCH v3 09/15] xen/x86: introduce a new amd cppc driver for cpufreq scaling Penny Zheng
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Penny Zheng @ 2025-03-06  8:39 UTC (permalink / raw)
  To: xen-devel
  Cc: ray.huang, Penny Zheng, Jan Beulich, Andrew Cooper,
	Roger Pau Monné

When _CPC table could not provide processor frequency range
values for Xen governor, we need to read processor max frequency
as anchor point.

For AMD processors, we export max frequency value from amd_log_freq()

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
---
v1 -> v2:
- new commit
---
xen/amd: export processor max frequency value

When _CPC table could not provide processor frequency range
values for Xen governor, we need to read processor max frequency
as anchor point.

For AMD processors, we export max frequency value from amd_log_freq()

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
---
v1 -> v2:
- new commit
---
v2 -> v3:
- Replace per_cpu with this_cpu
- Add amd_ prefix for AMD-only variable
---
 xen/arch/x86/cpu/amd.c         | 10 +++++++++-
 xen/arch/x86/include/asm/amd.h |  1 +
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c
index 7fb1d76798..6ab1cff3e5 100644
--- a/xen/arch/x86/cpu/amd.c
+++ b/xen/arch/x86/cpu/amd.c
@@ -56,6 +56,8 @@ bool __initdata amd_virt_spec_ctrl;
 
 static bool __read_mostly fam17_c6_disabled;
 
+DEFINE_PER_CPU_READ_MOSTLY(uint64_t, amd_max_freq_mhz);
+
 static inline int rdmsr_amd_safe(unsigned int msr, unsigned int *lo,
 				 unsigned int *hi)
 {
@@ -681,9 +683,15 @@ void amd_log_freq(const struct cpuinfo_x86 *c)
 		printk("CPU%u: %lu ... %lu MHz\n",
 		       smp_processor_id(),
 		       amd_parse_freq(c, lo), amd_parse_freq(c, hi));
-	else
+	else {
 		printk("CPU%u: %lu MHz\n", smp_processor_id(),
 		       amd_parse_freq(c, lo));
+		return;
+	}
+
+	/* Store max frequency for amd-cppc cpufreq driver */
+	if (hi >> 63)
+		this_cpu(amd_max_freq_mhz) = amd_parse_freq(c, hi);
 }
 
 void cf_check early_init_amd(struct cpuinfo_x86 *c)
diff --git a/xen/arch/x86/include/asm/amd.h b/xen/arch/x86/include/asm/amd.h
index 9c9599a622..cf9177c00a 100644
--- a/xen/arch/x86/include/asm/amd.h
+++ b/xen/arch/x86/include/asm/amd.h
@@ -174,4 +174,5 @@ bool amd_setup_legacy_ssbd(void);
 void amd_set_legacy_ssbd(bool enable);
 void amd_set_cpuid_user_dis(bool enable);
 
+DECLARE_PER_CPU(uint64_t, amd_max_freq_mhz);
 #endif /* __AMD_H__ */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 09/15] xen/x86: introduce a new amd cppc driver for cpufreq scaling
  2025-03-06  8:39 [PATCH v3 00/15] amd-cppc CPU Performance Scaling Driver Penny Zheng
                   ` (7 preceding siblings ...)
  2025-03-06  8:39 ` [PATCH v3 08/15] xen/amd: export processor max frequency value Penny Zheng
@ 2025-03-06  8:39 ` Penny Zheng
  2025-03-25  9:57   ` Jan Beulich
  2025-03-06  8:39 ` [PATCH v3 10/15] xen/cpufreq: only set gov NULL when cpufreq_driver.setpolicy is NULL Penny Zheng
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Penny Zheng @ 2025-03-06  8:39 UTC (permalink / raw)
  To: xen-devel
  Cc: ray.huang, Penny Zheng, Jan Beulich, Andrew Cooper,
	Roger Pau Monné

amd-cppc is the AMD CPU performance scaling driver that introduces a
new CPU frequency control mechanism firstly on AMD Zen based CPU series.
The new mechanism is based on Collaborative Processor Performance
Control (CPPC) which is a finer grain frequency management
than legacy ACPI hardware P-States.
Current AMD CPU platforms are using the ACPI P-states driver to
manage CPU frequency and clocks with switching only in 3 P-states.
The new amd-cppc allows a more flexible, low-latency interface for Xen
to directly communicate the performance hints to hardware.

The first version "amd-cppc" could leverage common governors such as
*ondemand*, *performance*, etc, to manage the performance hints. In the
future, we will introduce an advanced active mode to enable autonomous
performence level selection.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
---
v1 -> v2:
- re-construct union caps and req to have anonymous struct instead
- avoid "else" when the earlier if() ends in an unconditional control flow statement
- Add check to avoid chopping off set bits from cast
- make pointers pointer-to-const wherever possible
- remove noisy log
- exclude families before 0x17 before CPPC-feature MSR op
- remove useless variable helpers
- use xvzalloc and XVFREE
- refactor error handling as ENABLE bit can only be cleared by reset
---
v2 -> v3:
- Move all MSR-definations to msr-index.h and follow the required style
- Refactor opening figure braces for struct/union
- Sort overlong lines throughout the series
- Make offset/res int covering underflow scenario
- Error out when amd_max_freq_mhz isn't set
- Introduce amd_get_freq(name) macro to decrease redundancy
- Supported CPU family checked ahead of smp-function
- Nominal freq shall be checked between the [min, max]
- Use APERF/MPREF to calculate current frequency
- Use amd_cppc_cpufreq_cpu_exit() to tidy error path
---
 xen/arch/x86/acpi/cpufreq/amd-cppc.c | 370 +++++++++++++++++++++++++++
 xen/arch/x86/include/asm/msr-index.h |   5 +
 2 files changed, 375 insertions(+)

diff --git a/xen/arch/x86/acpi/cpufreq/amd-cppc.c b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
index 7d482140a2..bf30990c74 100644
--- a/xen/arch/x86/acpi/cpufreq/amd-cppc.c
+++ b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
@@ -14,7 +14,50 @@
 #include <xen/domain.h>
 #include <xen/init.h>
 #include <xen/param.h>
+#include <xen/percpu.h>
+#include <xen/xvmalloc.h>
 #include <acpi/cpufreq/cpufreq.h>
+#include <asm/amd.h>
+#include <asm/msr-index.h>
+
+#define amd_cppc_err(cpu, fmt, args...)                     \
+    printk(XENLOG_ERR "AMD_CPPC: CPU%u error: " fmt, cpu, ## args)
+#define amd_cppc_warn(fmt, args...)                         \
+    printk(XENLOG_WARNING "AMD_CPPC: CPU%u warning: " fmt, cpu, ## args)
+#define amd_cppc_verbose(fmt, args...)                      \
+({                                                          \
+    if ( cpufreq_verbose )                                  \
+        printk(XENLOG_DEBUG "AMD_CPPC: " fmt, ## args);     \
+})
+
+struct amd_cppc_drv_data
+{
+    const struct xen_processor_cppc *cppc_data;
+    union {
+        uint64_t raw;
+        struct {
+            unsigned int lowest_perf:8;
+            unsigned int lowest_nonlinear_perf:8;
+            unsigned int nominal_perf:8;
+            unsigned int highest_perf:8;
+            unsigned int :32;
+        };
+    } caps;
+    union {
+        uint64_t raw;
+        struct {
+            unsigned int max_perf:8;
+            unsigned int min_perf:8;
+            unsigned int des_perf:8;
+            unsigned int epp:8;
+            unsigned int :32;
+        };
+    } req;
+
+    int err;
+};
+
+static DEFINE_PER_CPU_READ_MOSTLY(struct amd_cppc_drv_data *, amd_cppc_drv_data);
 
 static bool __init amd_cppc_handle_option(const char *s, const char *end)
 {
@@ -51,10 +94,337 @@ int __init amd_cppc_cmdline_parse(const char *s, const char *e)
     return 0;
 }
 
+/*
+ * If CPPC lowest_freq and nominal_freq registers are exposed then we can
+ * use them to convert perf to freq and vice versa. The conversion is
+ * extrapolated as an linear function passing by the 2 points:
+ *  - (Low perf, Low freq)
+ *  - (Nominal perf, Nominal freq)
+ */
+static int amd_cppc_khz_to_perf(const struct amd_cppc_drv_data *data,
+                                unsigned int freq, uint8_t *perf)
+{
+    const struct xen_processor_cppc *cppc_data = data->cppc_data;
+    uint64_t mul, div;
+    int offset = 0, res;
+
+    if ( freq == (cppc_data->nominal_mhz * 1000) )
+    {
+        *perf = data->caps.nominal_perf;
+        return 0;
+    }
+
+    if ( freq == (cppc_data->lowest_mhz * 1000) )
+    {
+        *perf = data->caps.lowest_perf;
+        return 0;
+    }
+
+    if ( cppc_data->lowest_mhz && cppc_data->nominal_mhz )
+    {
+        mul = data->caps.nominal_perf - data->caps.lowest_perf;
+        div = cppc_data->nominal_mhz - cppc_data->lowest_mhz;
+        /*
+         * We don't need to convert to KHz for computing offset and can
+         * directly use nominal_mhz and lowest_mhz as the division
+         * will remove the frequency unit.
+         */
+        div = div ?: 1;
+        offset = data->caps.nominal_perf -
+                 (mul * cppc_data->nominal_mhz) / div;
+    }
+    else
+    {
+        /* Read Processor Max Speed(mhz) as anchor point */
+        mul = data->caps.highest_perf;
+        div = this_cpu(amd_max_freq_mhz);
+        if ( !div )
+            return -EINVAL;
+    }
+
+    res = offset + (mul * freq) / (div * 1000);
+    if ( res > UINT8_MAX )
+    {
+        printk_once(XENLOG_WARNING
+                    "Perf value exceeds maximum value 255: %d\n", res);
+        *perf = 0xff;
+        return 0;
+    }
+    *perf = (uint8_t)res;
+
+    return 0;
+}
+
+#define amd_get_freq(name)                                                  \
+    static int amd_get_##name##_freq(const struct amd_cppc_drv_data *data,  \
+                                     unsigned int *freq)                    \
+    {                                                                       \
+        const struct xen_processor_cppc *cppc_data = data->cppc_data;       \
+        uint64_t mul, div, res;                                             \
+                                                                            \
+        if ( cppc_data->name##_mhz )                                        \
+        {                                                                   \
+            /* Switch to khz */                                             \
+            *freq = cppc_data->name##_mhz * 1000;                           \
+            return 0;                                                       \
+        }                                                                   \
+                                                                            \
+        /* Read Processor Max Speed(mhz) as anchor point */                 \
+        mul = this_cpu(amd_max_freq_mhz);                                   \
+        if ( !mul )                                                         \
+            return -EINVAL;                                                 \
+        div = data->caps.highest_perf;                                      \
+        res = (mul * data->caps.name##_perf * 1000) / div;                  \
+        if ( res > UINT_MAX )                                               \
+        {                                                                   \
+            printk(XENLOG_ERR                                               \
+                   "Frequeny exceeds maximum value UINT_MAX: %lu\n", res);  \
+            return -EINVAL;                                                 \
+        }                                                                   \
+        *freq = (unsigned int)res;                                          \
+                                                                            \
+        return 0;                                                           \
+    }                                                                       \
+
+amd_get_freq(lowest);
+amd_get_freq(nominal);
+
+static int amd_get_max_freq(const struct amd_cppc_drv_data *data,
+                            unsigned int *max_freq)
+{
+    unsigned int nom_freq, boost_ratio;
+    int res;
+
+    res = amd_get_nominal_freq(data, &nom_freq);
+    if ( res )
+        return res;
+
+    boost_ratio = (unsigned int)(data->caps.highest_perf /
+                                 data->caps.nominal_perf);
+    *max_freq = nom_freq * boost_ratio;
+
+    return 0;
+}
+
+static int cf_check amd_cppc_cpufreq_verify(struct cpufreq_policy *policy)
+{
+    cpufreq_verify_within_limits(policy, policy->cpuinfo.min_freq,
+                                 policy->cpuinfo.max_freq);
+
+    return 0;
+}
+
+static void amd_cppc_write_request_msrs(void *info)
+{
+    struct amd_cppc_drv_data *data = info;
+
+    if ( wrmsr_safe(MSR_AMD_CPPC_REQ, data->req.raw) )
+    {
+        data->err = -EINVAL;
+        return;
+    }
+}
+
+static int cf_check amd_cppc_write_request(unsigned int cpu, uint8_t min_perf,
+                                           uint8_t des_perf, uint8_t max_perf)
+{
+    struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data, cpu);
+    uint64_t prev = data->req.raw;
+
+    data->req.min_perf = min_perf;
+    data->req.max_perf = max_perf;
+    data->req.des_perf = des_perf;
+
+    if ( prev == data->req.raw )
+        return 0;
+
+    data->err = 0;
+    on_selected_cpus(cpumask_of(cpu), amd_cppc_write_request_msrs, data, 1);
+
+    return data->err;
+}
+
+static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
+                                            unsigned int target_freq,
+                                            unsigned int relation)
+{
+    unsigned int cpu = policy->cpu;
+    const struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data, cpu);
+    uint8_t des_perf;
+    int res;
+
+    if ( unlikely(!target_freq) )
+        return 0;
+
+    res = amd_cppc_khz_to_perf(data, target_freq, &des_perf);
+    if ( res )
+        return res;
+
+    return amd_cppc_write_request(policy->cpu, data->caps.lowest_nonlinear_perf,
+                                  des_perf, data->caps.highest_perf);
+}
+
+static void cf_check amd_cppc_init_msrs(void *info)
+{
+    struct cpufreq_policy *policy = info;
+    struct amd_cppc_drv_data *data = this_cpu(amd_cppc_drv_data);
+    uint64_t val;
+    unsigned int min_freq, nominal_freq, max_freq;
+
+    /* Package level MSR */
+    if ( rdmsr_safe(MSR_AMD_CPPC_ENABLE, val) )
+    {
+        amd_cppc_err(policy->cpu, "rdmsr_safe(MSR_AMD_CPPC_ENABLE)\n");
+        goto err;
+    }
+
+    /*
+     * Only when Enable bit is on, the hardware will calculate the processor’s
+     * performance capabilities and initialize the performance level fields in
+     * the CPPC capability registers.
+     */
+    if ( !(val & AMD_CPPC_ENABLE) )
+    {
+        val |= AMD_CPPC_ENABLE;
+        if ( wrmsr_safe(MSR_AMD_CPPC_ENABLE, val) )
+        {
+            amd_cppc_err(policy->cpu,
+                         "wrmsr_safe(MSR_AMD_CPPC_ENABLE, %lx)\n", val);
+            goto err;
+        }
+    }
+
+    if ( rdmsr_safe(MSR_AMD_CPPC_CAP1, data->caps.raw) )
+    {
+        amd_cppc_err(policy->cpu, "rdmsr_safe(MSR_AMD_CPPC_CAP1)\n");
+        goto err;
+    }
+
+    if ( data->caps.highest_perf == 0 || data->caps.lowest_perf == 0 ||
+         data->caps.nominal_perf == 0 || data->caps.lowest_nonlinear_perf == 0 )
+    {
+        amd_cppc_err(policy->cpu,
+                     "Platform malfunction, read CPPC highest_perf: %u, lowest_perf: %u, nominal_perf: %u, lowest_nonlinear_perf: %u zero value\n",
+                     data->caps.highest_perf, data->caps.lowest_perf,
+                     data->caps.nominal_perf, data->caps.lowest_nonlinear_perf);
+        goto err;
+    }
+
+    data->err = amd_get_lowest_freq(data, &min_freq);
+    if ( data->err )
+        return;
+
+    data->err = amd_get_nominal_freq(data, &nominal_freq);
+    if ( data->err )
+        return;
+
+    data->err = amd_get_max_freq(data, &max_freq);
+    if ( data->err )
+        return;
+
+    if ( min_freq > max_freq || nominal_freq > max_freq ||
+         nominal_freq < min_freq )
+    {
+        amd_cppc_err(policy->cpu,
+                     "min_freq(%u), or max_freq(%u), or nominal_freq(%u) value is incorrect\n",
+                     min_freq, max_freq, nominal_freq);
+        goto err;
+    }
+
+    policy->min = min_freq;
+    policy->max = max_freq;
+
+    policy->cpuinfo.min_freq = min_freq;
+    policy->cpuinfo.max_freq = max_freq;
+    policy->cpuinfo.perf_freq = nominal_freq;
+    /*
+     * Set after policy->cpuinfo.perf_freq, as we are taking
+     * APERF/MPERF average frequency as current frequency.
+     */
+    policy->cur = cpufreq_driver_getavg(policy->cpu, GOV_GETAVG);
+
+    return;
+
+ err:
+    data->err = -EINVAL;
+}
+
+/*
+ * The new AMD CPPC driver is different than legacy ACPI hardware P-State,
+ * which has a finer grain frequency range between the highest and lowest
+ * frequency. And boost frequency is actually the frequency which is mapped on
+ * highest performance ratio. The legacy P0 frequency is actually mapped on
+ * nominal performance ratio.
+ */
+static void amd_cppc_boost_init(struct cpufreq_policy *policy,
+                                const struct amd_cppc_drv_data *data)
+{
+    if ( data->caps.highest_perf <= data->caps.nominal_perf )
+        return;
+
+    policy->turbo = CPUFREQ_TURBO_ENABLED;
+}
+
+static int cf_check amd_cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)
+{
+    XVFREE(per_cpu(amd_cppc_drv_data, policy->cpu));
+
+    return 0;
+}
+
+static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
+{
+    unsigned int cpu = policy->cpu;
+    struct amd_cppc_drv_data *data;
+    const struct cpuinfo_x86 *c = cpu_data + cpu;
+
+    data = xvzalloc(struct amd_cppc_drv_data);
+    if ( !data )
+        return -ENOMEM;
+
+    data->cppc_data = &processor_pminfo[cpu]->cppc_data;
+
+    per_cpu(amd_cppc_drv_data, cpu) = data;
+
+    /* Feature CPPC is firstly introduced on Zen2 */
+    if ( c->x86 < 0x17 )
+    {
+        printk_once("Unsupported cpu family: %x\n", c->x86);
+        return -EOPNOTSUPP;
+    }
+
+    on_selected_cpus(cpumask_of(cpu), amd_cppc_init_msrs, policy, 1);
+
+    /*
+     * If error path takes effective, not only amd-cppc cpufreq driver fails
+     * to initialize, but also we could not fall back to legacy P-states
+     * driver nevertheless we specifies fall back option in cmdline.
+     */
+    if ( data->err )
+    {
+        amd_cppc_err(cpu, "Could not initialize AMD CPPC MSR properly\n");
+        amd_cppc_cpufreq_cpu_exit(policy);
+        return -ENODEV;
+    }
+
+    policy->governor = cpufreq_opt_governor ? : CPUFREQ_DEFAULT_GOVERNOR;
+
+    amd_cppc_boost_init(policy, data);
+
+    amd_cppc_verbose("CPU %u initialized with amd-cppc passive mode\n",
+                     policy->cpu);
+
+    return 0;
+}
+
 static const struct cpufreq_driver __initconst_cf_clobber
 amd_cppc_cpufreq_driver =
 {
     .name   = XEN_AMD_CPPC_DRIVER_NAME,
+    .verify = amd_cppc_cpufreq_verify,
+    .target = amd_cppc_cpufreq_target,
+    .init   = amd_cppc_cpufreq_cpu_init,
+    .exit   = amd_cppc_cpufreq_cpu_exit,
 };
 
 int __init amd_cppc_register_driver(void)
diff --git a/xen/arch/x86/include/asm/msr-index.h b/xen/arch/x86/include/asm/msr-index.h
index 22d9e76e55..985f33eca1 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -238,6 +238,11 @@
 
 #define MSR_AMD_CSTATE_CFG                  0xc0010296U
 
+#define MSR_AMD_CPPC_CAP1                   0xc00102b0
+#define MSR_AMD_CPPC_ENABLE                 0xc00102b1
+#define  AMD_CPPC_ENABLE                    (_AC(1, ULL) <<  0)
+#define MSR_AMD_CPPC_REQ                    0xc00102b3
+
 /*
  * Legacy MSR constants in need of cleanup.  No new MSRs below this comment.
  */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 10/15] xen/cpufreq: only set gov NULL when cpufreq_driver.setpolicy is NULL
  2025-03-06  8:39 [PATCH v3 00/15] amd-cppc CPU Performance Scaling Driver Penny Zheng
                   ` (8 preceding siblings ...)
  2025-03-06  8:39 ` [PATCH v3 09/15] xen/x86: introduce a new amd cppc driver for cpufreq scaling Penny Zheng
@ 2025-03-06  8:39 ` Penny Zheng
  2025-03-24 16:32   ` Jan Beulich
  2025-03-06  8:39 ` [PATCH v3 11/15] xen/cpufreq: abstract Energy Performance Preference value Penny Zheng
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Penny Zheng @ 2025-03-06  8:39 UTC (permalink / raw)
  To: xen-devel; +Cc: ray.huang, Penny Zheng, Jan Beulich, Penny Zheng

From: Penny Zheng <penny.zheng@amd.com>

amd-cppc on active mode bypasses the scaling governor layer, and
provides its own P-state selection algorithms in hardware. Consequently,
when it is used, the driver's -> setpolicy() callback is invoked
to register per-CPU utilization update callbacks, not the ->target()
callback.

So, only when cpufreq_driver.setpolicy is NULL, we need to deliberately
set old gov as NULL to trigger the according gov starting.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/drivers/cpufreq/cpufreq.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/cpufreq/cpufreq.c b/xen/drivers/cpufreq/cpufreq.c
index 792e4dc02c..8fc6e527c2 100644
--- a/xen/drivers/cpufreq/cpufreq.c
+++ b/xen/drivers/cpufreq/cpufreq.c
@@ -353,7 +353,13 @@ int cpufreq_add_cpu(unsigned int cpu)
     if (hw_all || (cpumask_weight(cpufreq_dom->map) ==
                    pmpt->domain_info.num_processors)) {
         memcpy(&new_policy, policy, sizeof(struct cpufreq_policy));
-        policy->governor = NULL;
+
+       /*
+        * Only when cpufreq_driver.setpolicy == NULL, we need to deliberately
+        * set old gov as NULL to trigger the according gov starting.
+        */
+       if ( cpufreq_driver.setpolicy == NULL )
+            policy->governor = NULL;
 
         cpufreq_cmdline_common_para(&new_policy);
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 11/15] xen/cpufreq: abstract Energy Performance Preference value
  2025-03-06  8:39 [PATCH v3 00/15] amd-cppc CPU Performance Scaling Driver Penny Zheng
                   ` (9 preceding siblings ...)
  2025-03-06  8:39 ` [PATCH v3 10/15] xen/cpufreq: only set gov NULL when cpufreq_driver.setpolicy is NULL Penny Zheng
@ 2025-03-06  8:39 ` Penny Zheng
  2025-03-25 10:13   ` Jan Beulich
  2025-03-06  8:39 ` [PATCH v3 12/15] xen/x86: implement EPP support for the amd-cppc driver in active mode Penny Zheng
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Penny Zheng @ 2025-03-06  8:39 UTC (permalink / raw)
  To: xen-devel
  Cc: ray.huang, Penny Zheng, Jan Beulich, Andrew Cooper,
	Roger Pau Monné

Intel's hwp Energy Performance Preference value is compatible with
CPPC's Energy Performance Preference value, so this commit abstracts
the value and re-place it in common header file cpufreq.h, to be
used not only for hwp in the future.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/acpi/cpufreq/hwp.c    | 10 +++-------
 xen/include/acpi/cpufreq/cpufreq.h | 10 ++++++++++
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/acpi/cpufreq/hwp.c b/xen/arch/x86/acpi/cpufreq/hwp.c
index 59b57a4cef..d5fa3d47ca 100644
--- a/xen/arch/x86/acpi/cpufreq/hwp.c
+++ b/xen/arch/x86/acpi/cpufreq/hwp.c
@@ -21,10 +21,6 @@ static bool __ro_after_init feature_hdc;
 
 static bool __ro_after_init opt_cpufreq_hdc = true;
 
-#define HWP_ENERGY_PERF_MAX_PERFORMANCE 0
-#define HWP_ENERGY_PERF_BALANCE         0x80
-#define HWP_ENERGY_PERF_MAX_POWERSAVE   0xff
-
 union hwp_request
 {
     struct
@@ -597,7 +593,7 @@ int set_hwp_para(struct cpufreq_policy *policy,
         data->minimum = data->hw.lowest;
         data->maximum = data->hw.lowest;
         data->activity_window = 0;
-        data->energy_perf = HWP_ENERGY_PERF_MAX_POWERSAVE;
+        data->energy_perf = CPPC_ENERGY_PERF_MAX_POWERSAVE;
         data->desired = 0;
         break;
 
@@ -605,7 +601,7 @@ int set_hwp_para(struct cpufreq_policy *policy,
         data->minimum = data->hw.highest;
         data->maximum = data->hw.highest;
         data->activity_window = 0;
-        data->energy_perf = HWP_ENERGY_PERF_MAX_PERFORMANCE;
+        data->energy_perf = CPPC_ENERGY_PERF_MAX_PERFORMANCE;
         data->desired = 0;
         break;
 
@@ -613,7 +609,7 @@ int set_hwp_para(struct cpufreq_policy *policy,
         data->minimum = data->hw.lowest;
         data->maximum = data->hw.highest;
         data->activity_window = 0;
-        data->energy_perf = HWP_ENERGY_PERF_BALANCE;
+        data->energy_perf = CPPC_ENERGY_PERF_BALANCE;
         data->desired = 0;
         break;
 
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index a6fb10ea27..3c2b951830 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -253,6 +253,16 @@ void cpufreq_dbs_timer_resume(void);
 
 void intel_feature_detect(struct cpufreq_policy *policy);
 
+/*
+ * If Energy Performance Preference(epp) is supported in the platform,
+ * OSPM may write a range of values from 0(performance preference)
+ * to 0xFF(energy efficiency perference) to control the platform's
+ * energy efficiency and performance optimization policies
+ */
+#define CPPC_ENERGY_PERF_MAX_PERFORMANCE 0
+#define CPPC_ENERGY_PERF_BALANCE         0x80
+#define CPPC_ENERGY_PERF_MAX_POWERSAVE   0xff
+
 int hwp_cmdline_parse(const char *s, const char *e);
 int hwp_register_driver(void);
 #ifdef CONFIG_INTEL
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 12/15] xen/x86: implement EPP support for the amd-cppc driver in active mode
  2025-03-06  8:39 [PATCH v3 00/15] amd-cppc CPU Performance Scaling Driver Penny Zheng
                   ` (10 preceding siblings ...)
  2025-03-06  8:39 ` [PATCH v3 11/15] xen/cpufreq: abstract Energy Performance Preference value Penny Zheng
@ 2025-03-06  8:39 ` Penny Zheng
  2025-03-25 10:48   ` Jan Beulich
  2025-03-06  8:39 ` [PATCH v3 13/15] tools/xenpm: Print CPPC parameters for amd-cppc driver Penny Zheng
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Penny Zheng @ 2025-03-06  8:39 UTC (permalink / raw)
  To: xen-devel
  Cc: ray.huang, Penny Zheng, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Jan Beulich, Julien Grall, Roger Pau Monné,
	Stefano Stabellini

amd-cppc has 2 operation modes: autonomous (active) mode,
non-autonomous (passive) mode.
In active mode, platform ignores the requestd done in the Desired
Performance Target register and takes into account only the values
set to the minimum, maximum and energy performance preference(EPP)
registers.
The EPP is used in the CCLK DPM controller to drive the frequency
that a core is going to operate during short periods of activity.
The SOC EPP targets are configured on a scale from 0 to 255 where 0
represents maximum performance and 255 represents maximum efficiency.

This commit implements one new AMD CPU frequency driver `amd-cppc-epp`
for active mode. It also introduce `active` tag for users to explicitly
select active mode and a new variable `opt_active_mode` to keep track of
which mode is currently enabled.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
---
v1 -> v2:
- Remove redundant epp_mode
- Remove pointless initializer
- Define sole caller read_epp_init_once and epp_init value to read
pre-defined BIOS epp value only once
- Combine the commit "xen/cpufreq: introduce policy type when
cpufreq_driver->setpolicy exists"
---
v2 -> v3:
- Combined with commit "x86/cpufreq: add "cpufreq=amd-cppc,active" para"
- Refactor doc about "active mode"
- Change opt_cpufreq_active to opt_active_mode
- Let caller pass epp_init when unspecified to allow the function parameter
to be of uint8_t
- Make epp_init per-cpu value
---
 docs/misc/xen-command-line.pandoc    |   8 +-
 xen/arch/x86/acpi/cpufreq/amd-cppc.c | 119 +++++++++++++++++++++++++--
 xen/drivers/cpufreq/utility.c        |  11 +++
 xen/include/acpi/cpufreq/cpufreq.h   |  12 +++
 xen/include/public/sysctl.h          |   1 +
 5 files changed, 145 insertions(+), 6 deletions(-)

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index b3c3ca2377..19094070b3 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -515,7 +515,7 @@ If set, force use of the performance counters for oprofile, rather than detectin
 available support.
 
 ### cpufreq
-> `= none | {{ <boolean> | xen } { [:[powersave|performance|ondemand|userspace][,[<maxfreq>]][,[<minfreq>]]] } [,verbose]} | dom0-kernel | hwp[:[<hdc>][,verbose]] | amd-cppc[:[verbose]]`
+> `= none | {{ <boolean> | xen } { [:[powersave|performance|ondemand|userspace][,[<maxfreq>]][,[<minfreq>]]] } [,verbose]} | dom0-kernel | hwp[:[<hdc>][,verbose]] | amd-cppc[:[active][,verbose]]`
 
 > Default: `xen`
 
@@ -537,6 +537,12 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
 * `amd-cppc` selects ACPI Collaborative Performance and Power Control (CPPC)
   on supported AMD hardware to provide finer grained frequency control
   mechanism. The default is disabled.
+* `active` is to enable amd-cppc driver in active(autonomous) mode. In this
+  mode, users could write to energy performance preference register to tell
+  hardware if they want to bias toward performance or energy efficiency. Then
+  built-in CPPC power algorithm will calculate the runtime workload and adjust
+  the realtime cores frequency automatically according to the power supply and
+  thermal, core voltage and some other hardware conditions.
 
 User could use `;`-separated options to support universal options which they
 would like to try on any agnostic platform, *but* under priority order, like
diff --git a/xen/arch/x86/acpi/cpufreq/amd-cppc.c b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
index bf30990c74..606bb648b3 100644
--- a/xen/arch/x86/acpi/cpufreq/amd-cppc.c
+++ b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
@@ -30,6 +30,9 @@
         printk(XENLOG_DEBUG "AMD_CPPC: " fmt, ## args);     \
 })
 
+static bool __ro_after_init opt_active_mode;
+static DEFINE_PER_CPU_READ_MOSTLY(uint8_t, epp_init);
+
 struct amd_cppc_drv_data
 {
     const struct xen_processor_cppc *cppc_data;
@@ -70,6 +73,13 @@ static bool __init amd_cppc_handle_option(const char *s, const char *end)
         return true;
     }
 
+    ret = parse_boolean("active", s, end);
+    if ( ret >= 0 )
+    {
+        opt_active_mode = ret;
+        return true;
+    }
+
     return false;
 }
 
@@ -226,14 +236,19 @@ static void amd_cppc_write_request_msrs(void *info)
 }
 
 static int cf_check amd_cppc_write_request(unsigned int cpu, uint8_t min_perf,
-                                           uint8_t des_perf, uint8_t max_perf)
+                                           uint8_t des_perf, uint8_t max_perf,
+                                           uint8_t epp)
 {
     struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data, cpu);
     uint64_t prev = data->req.raw;
 
     data->req.min_perf = min_perf;
     data->req.max_perf = max_perf;
-    data->req.des_perf = des_perf;
+    if ( !opt_active_mode )
+        data->req.des_perf = des_perf;
+    else
+        data->req.des_perf = 0;
+    data->req.epp = epp;
 
     if ( prev == data->req.raw )
         return 0;
@@ -261,7 +276,20 @@ static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
         return res;
 
     return amd_cppc_write_request(policy->cpu, data->caps.lowest_nonlinear_perf,
-                                  des_perf, data->caps.highest_perf);
+                                  des_perf, data->caps.highest_perf,
+                                  /* Pre-defined BIOS value for passive mode */
+                                  per_cpu(epp_init, policy->cpu));
+}
+
+static int read_epp_init(void)
+{
+    uint64_t val;
+
+    if ( rdmsr_safe(MSR_AMD_CPPC_REQ, val) )
+        return -EINVAL;
+    this_cpu(epp_init) = (val >> 24) & 0xFF;
+
+    return 0;
 }
 
 static void cf_check amd_cppc_init_msrs(void *info)
@@ -343,6 +371,8 @@ static void cf_check amd_cppc_init_msrs(void *info)
      */
     policy->cur = cpufreq_driver_getavg(policy->cpu, GOV_GETAVG);
 
+    data->err = read_epp_init();
+
     return;
 
  err:
@@ -372,7 +402,7 @@ static int cf_check amd_cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)
     return 0;
 }
 
-static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
+static int amd_cppc_cpufreq_init_perf(struct cpufreq_policy *policy)
 {
     unsigned int cpu = policy->cpu;
     struct amd_cppc_drv_data *data;
@@ -411,12 +441,78 @@ static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
 
     amd_cppc_boost_init(policy, data);
 
+    return 0;
+}
+
+static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
+{
+    int ret;
+
+    ret = amd_cppc_cpufreq_init_perf(policy);
+    if ( ret )
+        return ret;
+
     amd_cppc_verbose("CPU %u initialized with amd-cppc passive mode\n",
                      policy->cpu);
 
     return 0;
 }
 
+static int cf_check amd_cppc_epp_cpu_init(struct cpufreq_policy *policy)
+{
+    int ret;
+
+    ret = amd_cppc_cpufreq_init_perf(policy);
+    if ( ret )
+        return ret;
+
+    policy->policy = cpufreq_parse_policy(policy->governor);
+
+    amd_cppc_verbose("CPU %u initialized with amd-cppc active mode\n", policy->cpu);
+
+    return 0;
+}
+
+static int amd_cppc_epp_update_limit(const struct cpufreq_policy *policy)
+{
+    const struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data,
+                                                    policy->cpu);
+    uint8_t max_perf, min_perf, epp;
+
+    /* Initial min/max values for CPPC Performance Controls Register */
+    /*
+     * Continuous CPPC performance scale in active mode is [lowest_perf,
+     * highest_perf]
+     */
+    max_perf = data->caps.highest_perf;
+    min_perf = data->caps.lowest_perf;
+
+    epp = per_cpu(epp_init, policy->cpu);
+    if ( policy->policy == CPUFREQ_POLICY_PERFORMANCE )
+    {
+        /* Force the epp value to be zero for performance policy */
+        epp = CPPC_ENERGY_PERF_MAX_PERFORMANCE;
+        min_perf = max_perf;
+    }
+    else if ( policy->policy == CPUFREQ_POLICY_POWERSAVE )
+        /* Force the epp value to be 0xff for powersave policy */
+        /*
+         * If set max_perf = min_perf = lowest_perf, we are putting
+         * cpu cores in idle.
+         */
+        epp = CPPC_ENERGY_PERF_MAX_POWERSAVE;
+
+    return amd_cppc_write_request(policy->cpu, min_perf,
+                                  /* des_perf = 0 for epp mode */
+                                  0,
+                                  max_perf, epp);
+}
+
+static int cf_check amd_cppc_epp_set_policy(struct cpufreq_policy *policy)
+{
+    return amd_cppc_epp_update_limit(policy);
+}
+
 static const struct cpufreq_driver __initconst_cf_clobber
 amd_cppc_cpufreq_driver =
 {
@@ -427,6 +523,16 @@ amd_cppc_cpufreq_driver =
     .exit   = amd_cppc_cpufreq_cpu_exit,
 };
 
+static const struct cpufreq_driver __initconst_cf_clobber
+amd_cppc_epp_driver =
+{
+    .name       = XEN_AMD_CPPC_EPP_DRIVER_NAME,
+    .verify     = amd_cppc_cpufreq_verify,
+    .setpolicy  = amd_cppc_epp_set_policy,
+    .init       = amd_cppc_epp_cpu_init,
+    .exit       = amd_cppc_cpufreq_cpu_exit,
+};
+
 int __init amd_cppc_register_driver(void)
 {
     int ret;
@@ -437,7 +543,10 @@ int __init amd_cppc_register_driver(void)
         return -ENODEV;
     }
 
-    ret = cpufreq_register_driver(&amd_cppc_cpufreq_driver);
+    if ( opt_active_mode )
+        ret = cpufreq_register_driver(&amd_cppc_epp_driver);
+    else
+        ret = cpufreq_register_driver(&amd_cppc_cpufreq_driver);
     if ( ret )
         return ret;
 
diff --git a/xen/drivers/cpufreq/utility.c b/xen/drivers/cpufreq/utility.c
index f1fd2fdbce..abde499d40 100644
--- a/xen/drivers/cpufreq/utility.c
+++ b/xen/drivers/cpufreq/utility.c
@@ -491,3 +491,14 @@ int __cpufreq_set_policy(struct cpufreq_policy *data,
 
     return __cpufreq_governor(data, CPUFREQ_GOV_LIMITS);
 }
+
+unsigned int cpufreq_parse_policy(const struct cpufreq_governor *gov)
+{
+    if ( !strncasecmp(gov->name, "performance", CPUFREQ_NAME_LEN) )
+        return CPUFREQ_POLICY_PERFORMANCE;
+
+    if ( !strncasecmp(gov->name, "powersave", CPUFREQ_NAME_LEN) )
+        return CPUFREQ_POLICY_POWERSAVE;
+
+    return CPUFREQ_POLICY_UNKNOWN;
+}
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 3c2b951830..7c36634d40 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -83,6 +83,7 @@ struct cpufreq_policy {
     int8_t              turbo;  /* tristate flag: 0 for unsupported
                                  * -1 for disable, 1 for enabled
                                  * See CPUFREQ_TURBO_* below for defines */
+    unsigned int        policy; /* CPUFREQ_POLICY_* */
 };
 DECLARE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_policy);
 
@@ -133,6 +134,17 @@ extern int cpufreq_register_governor(struct cpufreq_governor *governor);
 extern struct cpufreq_governor *__find_governor(const char *governor);
 #define CPUFREQ_DEFAULT_GOVERNOR &cpufreq_gov_dbs
 
+#define CPUFREQ_POLICY_UNKNOWN      0
+/*
+ * If cpufreq_driver->target() exists, the ->governor decides what frequency
+ * within the limits is used. If cpufreq_driver->setpolicy() exists, these
+ * two generic policies are available:
+ */
+#define CPUFREQ_POLICY_POWERSAVE    1
+#define CPUFREQ_POLICY_PERFORMANCE  2
+
+unsigned int cpufreq_parse_policy(const struct cpufreq_governor *gov);
+
 /* pass a target to the cpufreq driver */
 extern int __cpufreq_driver_target(struct cpufreq_policy *policy,
                                    unsigned int target_freq,
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 42997252ef..fa431fd983 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -424,6 +424,7 @@ struct xen_set_cppc_para {
 };
 
 #define XEN_AMD_CPPC_DRIVER_NAME "amd-cppc"
+#define XEN_AMD_CPPC_EPP_DRIVER_NAME "amd-cppc-epp"
 #define XEN_HWP_DRIVER_NAME "hwp"
 
 /*
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 13/15] tools/xenpm: Print CPPC parameters for amd-cppc driver
  2025-03-06  8:39 [PATCH v3 00/15] amd-cppc CPU Performance Scaling Driver Penny Zheng
                   ` (11 preceding siblings ...)
  2025-03-06  8:39 ` [PATCH v3 12/15] xen/x86: implement EPP support for the amd-cppc driver in active mode Penny Zheng
@ 2025-03-06  8:39 ` Penny Zheng
  2025-03-06  8:39 ` [PATCH v3 14/15] xen/xenpm: Adapt cpu frequency monitor in xenpm Penny Zheng
  2025-03-06  8:39 ` [PATCH v3 15/15] xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC xen_sysctl_pm_op for amd-cppc driver Penny Zheng
  14 siblings, 0 replies; 60+ messages in thread
From: Penny Zheng @ 2025-03-06  8:39 UTC (permalink / raw)
  To: xen-devel
  Cc: ray.huang, Penny Zheng, Anthony PERARD, Penny Zheng, Jan Beulich

From: Penny Zheng <penny.zheng@amd.com>

HWP, amd-cppc, amd-cppc-epp are all the implementation
of ACPI CPPC (Collaborative Processor Performace Control),
so we introduce cppc_mode flag to print CPPC-related para.

And HWP and amd-cppc-epp are both governor-less driver,
so we introduce hw_auto flag to bypass governor-related print.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
 tools/misc/xenpm.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index 336d246346..a7aeaea35e 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -790,9 +790,18 @@ static unsigned int calculate_activity_window(const xc_cppc_para_t *cppc,
 /* print out parameters about cpu frequency */
 static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
 {
-    bool hwp = strcmp(p_cpufreq->scaling_driver, XEN_HWP_DRIVER_NAME) == 0;
+    bool cppc_mode = false, hw_auto = false;
     int i;
 
+    if ( !strcmp(p_cpufreq->scaling_driver, XEN_HWP_DRIVER_NAME) ||
+         !strcmp(p_cpufreq->scaling_driver, XEN_AMD_CPPC_DRIVER_NAME) ||
+         !strcmp(p_cpufreq->scaling_driver, XEN_AMD_CPPC_EPP_DRIVER_NAME) )
+        cppc_mode = true;
+
+    if ( !strcmp(p_cpufreq->scaling_driver, XEN_HWP_DRIVER_NAME) ||
+         !strcmp(p_cpufreq->scaling_driver, XEN_AMD_CPPC_EPP_DRIVER_NAME) )
+        hw_auto = true;
+
     printf("cpu id               : %d\n", cpuid);
 
     printf("affected_cpus        :");
@@ -800,7 +809,7 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
         printf(" %d", p_cpufreq->affected_cpus[i]);
     printf("\n");
 
-    if ( hwp )
+    if ( hw_auto )
         printf("cpuinfo frequency    : base [%"PRIu32"] max [%"PRIu32"]\n",
                p_cpufreq->cpuinfo_min_freq,
                p_cpufreq->cpuinfo_max_freq);
@@ -812,7 +821,7 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
 
     printf("scaling_driver       : %s\n", p_cpufreq->scaling_driver);
 
-    if ( hwp )
+    if ( cppc_mode )
     {
         const xc_cppc_para_t *cppc = &p_cpufreq->u.cppc_para;
 
@@ -838,7 +847,8 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
                cppc->desired,
                cppc->desired ? "" : " hw autonomous");
     }
-    else
+
+    if ( !hw_auto )
     {
         printf("scaling_avail_gov    : %s\n",
                p_cpufreq->scaling_available_governors);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 14/15] xen/xenpm: Adapt cpu frequency monitor in xenpm
  2025-03-06  8:39 [PATCH v3 00/15] amd-cppc CPU Performance Scaling Driver Penny Zheng
                   ` (12 preceding siblings ...)
  2025-03-06  8:39 ` [PATCH v3 13/15] tools/xenpm: Print CPPC parameters for amd-cppc driver Penny Zheng
@ 2025-03-06  8:39 ` Penny Zheng
  2025-03-25 11:26   ` Jan Beulich
  2025-03-06  8:39 ` [PATCH v3 15/15] xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC xen_sysctl_pm_op for amd-cppc driver Penny Zheng
  14 siblings, 1 reply; 60+ messages in thread
From: Penny Zheng @ 2025-03-06  8:39 UTC (permalink / raw)
  To: xen-devel
  Cc: ray.huang, Penny Zheng, Anthony PERARD, Juergen Gross,
	Jan Beulich

Make `xenpm get-cpureq-para/set-cpufreq-para` available in CPPC mode.
Also, In `xenpm get-cpufreq-para <cpuid>`, para scaling_available_frequencies
only has meaningful value when cpufreq driver in legacy P-states.
So we loosen "has_num" condition to bypass scaling_available_frequencies
check in CPPC mode.

Also, in `xenpm get-cpyfreq-para start`, the monitor of average frequency shall
not depend on the existence of legacy P-states.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
---
v2 -> v3:
- new commit
---
 tools/libs/ctrl/xc_pm.c   | 12 +++++++-----
 tools/misc/xenpm.c        |  5 +++--
 xen/drivers/acpi/pmstat.c | 30 +++++++++++++++++-------------
 3 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/tools/libs/ctrl/xc_pm.c b/tools/libs/ctrl/xc_pm.c
index b27b45c3dc..d843b79d6d 100644
--- a/tools/libs/ctrl/xc_pm.c
+++ b/tools/libs/ctrl/xc_pm.c
@@ -214,13 +214,12 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
 			 user_para->gov_num * CPUFREQ_NAME_LEN * sizeof(char), XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
 
     bool has_num = user_para->cpu_num &&
-                     user_para->freq_num &&
                      user_para->gov_num;
 
     if ( has_num )
     {
         if ( (!user_para->affected_cpus)                    ||
-             (!user_para->scaling_available_frequencies)    ||
+             (user_para->freq_num && !user_para->scaling_available_frequencies)    ||
              (user_para->gov_num && !user_para->scaling_available_governors) )
         {
             errno = EINVAL;
@@ -228,14 +227,16 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
         }
         if ( xc_hypercall_bounce_pre(xch, affected_cpus) )
             goto unlock_1;
-        if ( xc_hypercall_bounce_pre(xch, scaling_available_frequencies) )
+        if ( user_para->freq_num &&
+             xc_hypercall_bounce_pre(xch, scaling_available_frequencies) )
             goto unlock_2;
         if ( user_para->gov_num &&
              xc_hypercall_bounce_pre(xch, scaling_available_governors) )
             goto unlock_3;
 
         set_xen_guest_handle(sys_para->affected_cpus, affected_cpus);
-        set_xen_guest_handle(sys_para->scaling_available_frequencies, scaling_available_frequencies);
+        if ( user_para->freq_num )
+            set_xen_guest_handle(sys_para->scaling_available_frequencies, scaling_available_frequencies);
         if ( user_para->gov_num )
             set_xen_guest_handle(sys_para->scaling_available_governors,
                                  scaling_available_governors);
@@ -301,7 +302,8 @@ unlock_4:
     if ( user_para->gov_num )
         xc_hypercall_bounce_post(xch, scaling_available_governors);
 unlock_3:
-    xc_hypercall_bounce_post(xch, scaling_available_frequencies);
+    if ( user_para->freq_num )
+        xc_hypercall_bounce_post(xch, scaling_available_frequencies);
 unlock_2:
     xc_hypercall_bounce_post(xch, affected_cpus);
 unlock_1:
diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index a7aeaea35e..a521800504 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -539,7 +539,7 @@ static void signal_int_handler(int signo)
                         res / 1000000UL, 100UL * res / (double)sum_px[i]);
             }
         }
-        if ( px_cap && avgfreq[i] )
+        if ( avgfreq[i] )
             printf("  Avg freq\t%d\tKHz\n", avgfreq[i]);
     }
 
@@ -926,7 +926,8 @@ static int show_cpufreq_para_by_cpuid(xc_interface *xc_handle, int cpuid)
             ret = -ENOMEM;
             goto out;
         }
-        if (!(p_cpufreq->scaling_available_frequencies =
+        if (p_cpufreq->freq_num &&
+            !(p_cpufreq->scaling_available_frequencies =
               malloc(p_cpufreq->freq_num * sizeof(uint32_t))))
         {
             fprintf(stderr,
diff --git a/xen/drivers/acpi/pmstat.c b/xen/drivers/acpi/pmstat.c
index c8e00766a6..7f432be761 100644
--- a/xen/drivers/acpi/pmstat.c
+++ b/xen/drivers/acpi/pmstat.c
@@ -202,7 +202,7 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
     pmpt = processor_pminfo[op->cpuid];
     policy = per_cpu(cpufreq_cpu_policy, op->cpuid);
 
-    if ( !pmpt || !pmpt->perf.states ||
+    if ( !pmpt || ((pmpt->init & XEN_PX_INIT) && !pmpt->perf.states) ||
          !policy || !policy->governor )
         return -EINVAL;
 
@@ -229,17 +229,20 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
     if ( ret )
         return ret;
 
-    if ( !(scaling_available_frequencies =
-           xzalloc_array(uint32_t, op->u.get_para.freq_num)) )
-        return -ENOMEM;
-    for ( i = 0; i < op->u.get_para.freq_num; i++ )
-        scaling_available_frequencies[i] =
-                        pmpt->perf.states[i].core_frequency * 1000;
-    ret = copy_to_guest(op->u.get_para.scaling_available_frequencies,
-                   scaling_available_frequencies, op->u.get_para.freq_num);
-    xfree(scaling_available_frequencies);
-    if ( ret )
-        return ret;
+    if ( op->u.get_para.freq_num )
+    {
+        if ( !(scaling_available_frequencies =
+               xzalloc_array(uint32_t, op->u.get_para.freq_num)) )
+            return -ENOMEM;
+        for ( i = 0; i < op->u.get_para.freq_num; i++ )
+            scaling_available_frequencies[i] =
+                            pmpt->perf.states[i].core_frequency * 1000;
+        ret = copy_to_guest(op->u.get_para.scaling_available_frequencies,
+                    scaling_available_frequencies, op->u.get_para.freq_num);
+        xfree(scaling_available_frequencies);
+        if ( ret )
+            return ret;
+    }
 
     op->u.get_para.cpuinfo_cur_freq =
         cpufreq_driver.get ? alternative_call(cpufreq_driver.get, op->cpuid)
@@ -465,7 +468,8 @@ int do_pm_op(struct xen_sysctl_pm_op *op)
     switch ( op->cmd & PM_PARA_CATEGORY_MASK )
     {
     case CPUFREQ_PARA:
-        if ( !(xen_processor_pmbits & XEN_PROCESSOR_PM_PX) )
+        if ( !(xen_processor_pmbits & (XEN_PROCESSOR_PM_PX |
+                                       XEN_PROCESSOR_PM_CPPC)) )
             return -ENODEV;
         if ( !pmpt || !(pmpt->init & (XEN_PX_INIT | XEN_CPPC_INIT)) )
             return -EINVAL;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 15/15] xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC xen_sysctl_pm_op for amd-cppc driver
  2025-03-06  8:39 [PATCH v3 00/15] amd-cppc CPU Performance Scaling Driver Penny Zheng
                   ` (13 preceding siblings ...)
  2025-03-06  8:39 ` [PATCH v3 14/15] xen/xenpm: Adapt cpu frequency monitor in xenpm Penny Zheng
@ 2025-03-06  8:39 ` Penny Zheng
  2025-03-25 16:59   ` Jan Beulich
  14 siblings, 1 reply; 60+ messages in thread
From: Penny Zheng @ 2025-03-06  8:39 UTC (permalink / raw)
  To: xen-devel
  Cc: ray.huang, Penny Zheng, Jan Beulich, Andrew Cooper,
	Roger Pau Monné

Introduce helper set_amd_cppc_para and get_amd_cppc_para to
SET/GET CPPC-related para for amd-cppc/amd-cppc-epp driver.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
---
v1 -> v2:
- Give the variable des_perf an initializer of 0
- Use the strncmp()s directly in the if()
---
 xen/arch/x86/acpi/cpufreq/amd-cppc.c | 124 +++++++++++++++++++++++++++
 xen/drivers/acpi/pmstat.c            |  20 ++++-
 xen/include/acpi/cpufreq/cpufreq.h   |   5 ++
 3 files changed, 145 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/acpi/cpufreq/amd-cppc.c b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
index 606bb648b3..28c13b09c8 100644
--- a/xen/arch/x86/acpi/cpufreq/amd-cppc.c
+++ b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
@@ -32,6 +32,7 @@
 
 static bool __ro_after_init opt_active_mode;
 static DEFINE_PER_CPU_READ_MOSTLY(uint8_t, epp_init);
+static bool __ro_after_init amd_cppc_in_use;
 
 struct amd_cppc_drv_data
 {
@@ -513,6 +514,123 @@ static int cf_check amd_cppc_epp_set_policy(struct cpufreq_policy *policy)
     return amd_cppc_epp_update_limit(policy);
 }
 
+int get_amd_cppc_para(unsigned int cpu,
+                      struct xen_cppc_para *cppc_para)
+{
+    const struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data, cpu);
+
+    if ( data == NULL )
+        return -ENODATA;
+
+    cppc_para->features         = 0;
+    cppc_para->lowest           = data->caps.lowest_perf;
+    cppc_para->lowest_nonlinear = data->caps.lowest_nonlinear_perf;
+    cppc_para->nominal          = data->caps.nominal_perf;
+    cppc_para->highest          = data->caps.highest_perf;
+    cppc_para->minimum          = data->req.min_perf;
+    cppc_para->maximum          = data->req.max_perf;
+    cppc_para->desired          = data->req.des_perf;
+    cppc_para->energy_perf      = data->req.epp;
+
+    return 0;
+}
+
+int set_amd_cppc_para(const struct cpufreq_policy *policy,
+                      const struct xen_set_cppc_para *set_cppc)
+{
+    unsigned int cpu = policy->cpu;
+    struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data, cpu);
+    uint8_t max_perf, min_perf, des_perf = 0, epp;
+
+    if ( data == NULL )
+        return -ENOENT;
+
+    /* Validate all parameters - Disallow reserved bits. */
+    if ( set_cppc->minimum > UINT8_MAX || set_cppc->maximum > UINT8_MAX ||
+         set_cppc->desired > UINT8_MAX || set_cppc->energy_perf > UINT8_MAX )
+        return -EINVAL;
+
+    /* Only allow values if params bit is set. */
+    if ( (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED) &&
+          set_cppc->desired) ||
+         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MINIMUM) &&
+          set_cppc->minimum) ||
+         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MAXIMUM) &&
+          set_cppc->maximum) ||
+         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ENERGY_PERF) &&
+          set_cppc->energy_perf) )
+        return -EINVAL;
+
+    /* Activity window not supported in MSR */
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ACT_WINDOW )
+        return -EOPNOTSUPP;
+
+    /* Return if there is nothing to do. */
+    if ( set_cppc->set_params == 0 )
+        return 0;
+
+    epp = per_cpu(epp_init, cpu);
+    /* Apply presets */
+    /*
+     * XEN_SYSCTL_CPPC_SET_PRESET_POWERSAVE/PERFORMANCE/BALANCE are
+     * for amd-cppc in active mode, min_perf could be set with lowest_perf
+     * representing the T-state range of performance levels, while
+     * XEN_SYSCTL_CPPC_SET_PRESET_NONE is for amd-cppc in passive mode, it
+     * depends on governor to do performance scaling, setting with
+     * lowest_nonlinear_perf to ensures performance in P-state range.
+     */
+    switch ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_PRESET_MASK )
+    {
+    case XEN_SYSCTL_CPPC_SET_PRESET_POWERSAVE:
+        if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED )
+            return -EINVAL;
+        min_perf = data->caps.lowest_perf;
+        max_perf = data->caps.highest_perf;
+        epp = CPPC_ENERGY_PERF_MAX_POWERSAVE;
+        break;
+
+    case XEN_SYSCTL_CPPC_SET_PRESET_PERFORMANCE:
+        if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED )
+            return -EINVAL;
+        min_perf = data->caps.highest_perf;
+        max_perf = data->caps.highest_perf;
+        epp = CPPC_ENERGY_PERF_MAX_PERFORMANCE;
+        break;
+
+    case XEN_SYSCTL_CPPC_SET_PRESET_BALANCE:
+        if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED )
+            return -EINVAL;
+        min_perf = data->caps.lowest_perf;
+        max_perf = data->caps.highest_perf;
+        epp = CPPC_ENERGY_PERF_BALANCE;
+        break;
+
+    case XEN_SYSCTL_CPPC_SET_PRESET_NONE:
+        min_perf = data->caps.lowest_nonlinear_perf;
+        max_perf = data->caps.highest_perf;
+        break;
+
+    default:
+        return -EINVAL;
+    }
+
+    /* Further customize presets if needed */
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MINIMUM )
+        min_perf = set_cppc->minimum;
+
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MAXIMUM )
+        max_perf = set_cppc->maximum;
+
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ENERGY_PERF )
+        epp = set_cppc->energy_perf;
+
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED )
+        des_perf = set_cppc->desired;
+
+    return amd_cppc_write_request(cpu, min_perf, des_perf, max_perf, epp);
+}
+
+
 static const struct cpufreq_driver __initconst_cf_clobber
 amd_cppc_cpufreq_driver =
 {
@@ -533,6 +651,11 @@ amd_cppc_epp_driver =
     .exit       = amd_cppc_cpufreq_cpu_exit,
 };
 
+bool amd_cppc_active(void)
+{
+    return amd_cppc_in_use;
+}
+
 int __init amd_cppc_register_driver(void)
 {
     int ret;
@@ -552,6 +675,7 @@ int __init amd_cppc_register_driver(void)
 
     /* Remove possible fallback option */
     xen_processor_pmbits &= ~XEN_PROCESSOR_PM_PX;
+    amd_cppc_in_use = true;
 
     return ret;
 }
diff --git a/xen/drivers/acpi/pmstat.c b/xen/drivers/acpi/pmstat.c
index 7f432be761..9c96020d69 100644
--- a/xen/drivers/acpi/pmstat.c
+++ b/xen/drivers/acpi/pmstat.c
@@ -261,7 +261,16 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
          !strncmp(op->u.get_para.scaling_driver, XEN_HWP_DRIVER_NAME,
                   CPUFREQ_NAME_LEN) )
         ret = get_hwp_para(policy->cpu, &op->u.get_para.u.cppc_para);
-    else
+    else if ( !strncmp(op->u.get_para.scaling_driver, XEN_AMD_CPPC_DRIVER_NAME,
+                       CPUFREQ_NAME_LEN) ||
+              !strncmp(op->u.get_para.scaling_driver, XEN_AMD_CPPC_EPP_DRIVER_NAME,
+                       CPUFREQ_NAME_LEN) )
+        ret = get_amd_cppc_para(policy->cpu, &op->u.get_para.u.cppc_para);
+
+    if ( strncmp(op->u.get_para.scaling_driver, XEN_HWP_DRIVER_NAME,
+                 CPUFREQ_NAME_LEN) &&
+         strncmp(op->u.get_para.scaling_driver, XEN_AMD_CPPC_EPP_DRIVER_NAME,
+                 CPUFREQ_NAME_LEN) )
     {
         if ( !(scaling_available_governors =
                xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
@@ -417,10 +426,13 @@ static int set_cpufreq_cppc(struct xen_sysctl_pm_op *op)
     if ( !policy || !policy->governor )
         return -ENOENT;
 
-    if ( !hwp_active() )
-        return -EOPNOTSUPP;
+    if ( hwp_active() )
+        return set_hwp_para(policy, &op->u.set_cppc);
+
+    if ( amd_cppc_active() )
+        return set_amd_cppc_para(policy, &op->u.set_cppc);
 
-    return set_hwp_para(policy, &op->u.set_cppc);
+    return -EOPNOTSUPP;
 }
 
 int do_pm_op(struct xen_sysctl_pm_op *op)
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 7c36634d40..0a2eb2a26f 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -292,5 +292,10 @@ int acpi_cpufreq_register(void);
 
 int amd_cppc_cmdline_parse(const char *s, const char *e);
 int amd_cppc_register_driver(void);
+bool amd_cppc_active(void);
+int get_amd_cppc_para(unsigned int cpu,
+                      struct xen_cppc_para *cppc_para);
+int set_amd_cppc_para(const struct cpufreq_policy *policy,
+                      const struct xen_set_cppc_para *set_cppc);
 
 #endif /* __XEN_CPUFREQ_PM_H__ */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 01/15] xen/cpufreq: introduces XEN_PM_PSD for solely delivery of _PSD
  2025-03-06  8:39 ` [PATCH v3 01/15] xen/cpufreq: introduces XEN_PM_PSD for solely delivery of _PSD Penny Zheng
@ 2025-03-24 14:08   ` Jan Beulich
  2025-04-01  3:25     ` Penny, Zheng
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2025-03-24 14:08 UTC (permalink / raw)
  To: Penny Zheng
  Cc: ray.huang, Andrew Cooper, Roger Pau Monné, Anthony PERARD,
	Michal Orzel, Julien Grall, Stefano Stabellini, xen-devel

On 06.03.2025 09:39, Penny Zheng wrote:
> --- a/xen/include/public/platform.h
> +++ b/xen/include/public/platform.h
> @@ -363,12 +363,12 @@ DEFINE_XEN_GUEST_HANDLE(xenpf_getidletime_t);
>  #define XEN_PM_PX   1
>  #define XEN_PM_TX   2
>  #define XEN_PM_PDC  3
> +#define XEN_PM_PSD  4
>  
>  /* Px sub info type */
>  #define XEN_PX_PCT   1
>  #define XEN_PX_PSS   2
>  #define XEN_PX_PPC   4
> -#define XEN_PX_PSD   8
>  
>  struct xen_power_register {
>      uint32_t     space_id;
> @@ -439,6 +439,7 @@ struct xen_psd_package {
>      uint64_t coord_type;
>      uint64_t num_processors;
>  };
> +typedef struct xen_psd_package xen_psd_package_t;
>  
>  struct xen_processor_performance {
>      uint32_t flags;     /* flag for Px sub info type */
> @@ -447,12 +448,6 @@ struct xen_processor_performance {
>      struct xen_pct_register status_register;
>      uint32_t state_count;     /* total available performance states */
>      XEN_GUEST_HANDLE(xen_processor_px_t) states;
> -    struct xen_psd_package domain_info;
> -    /* Coordination type of this processor */
> -#define XEN_CPUPERF_SHARED_TYPE_HW   1 /* HW does needed coordination */
> -#define XEN_CPUPERF_SHARED_TYPE_ALL  2 /* All dependent CPUs should set freq */
> -#define XEN_CPUPERF_SHARED_TYPE_ANY  3 /* Freq can be set from any dependent CPU */
> -    uint32_t shared_type;
>  };
>  typedef struct xen_processor_performance xen_processor_performance_t;
>  DEFINE_XEN_GUEST_HANDLE(xen_processor_performance_t);
> @@ -463,9 +458,15 @@ struct xenpf_set_processor_pminfo {
>      uint32_t type;  /* {XEN_PM_CX, XEN_PM_PX} */
>      union {
>          struct xen_processor_power          power;/* Cx: _CST/_CSD */
> -        struct xen_processor_performance    perf; /* Px: _PPC/_PCT/_PSS/_PSD */
> +        xen_psd_package_t                   domain_info; /* _PSD */
> +        struct xen_processor_performance    perf; /* Px: _PPC/_PCT/_PSS/ */
>          XEN_GUEST_HANDLE(uint32)            pdc;  /* _PDC */
>      } u;
> +    /* Coordination type of this processor */
> +#define XEN_CPUPERF_SHARED_TYPE_HW   1 /* HW does needed coordination */
> +#define XEN_CPUPERF_SHARED_TYPE_ALL  2 /* All dependent CPUs should set freq */
> +#define XEN_CPUPERF_SHARED_TYPE_ANY  3 /* Freq can be set from any dependent CPU */
> +    uint32_t shared_type;
>  };
>  typedef struct xenpf_set_processor_pminfo xenpf_set_processor_pminfo_t;
>  DEFINE_XEN_GUEST_HANDLE(xenpf_set_processor_pminfo_t);

With this change to stable hypercall structures, how is an older Dom0 kernel
going to be able to properly upload the necessary data? IOW: No, you can't
alter existing stable hypercall structures like this.

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 02/15] xen/x86: introduce new sub-hypercall to propagate CPPC data
  2025-03-06  8:39 ` [PATCH v3 02/15] xen/x86: introduce new sub-hypercall to propagate CPPC data Penny Zheng
@ 2025-03-24 14:28   ` Jan Beulich
  2025-03-25  4:12     ` Penny, Zheng
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2025-03-24 14:28 UTC (permalink / raw)
  To: Penny Zheng
  Cc: ray.huang, Andrew Cooper, Roger Pau Monné, Anthony PERARD,
	Michal Orzel, Julien Grall, Stefano Stabellini, xen-devel

On 06.03.2025 09:39, Penny Zheng wrote:
> In order to provide backward compatibility with existing governors
> that represent performance as frequencies, like ondemand, the _CPC
> table can optionally provide processor frequency range values, Lowest
> frequency and Norminal frequency, to let OS use Lowest Frequency/
> Performance and Nominal Frequency/Performance as anchor points to
> create linear mapping of CPPC abstract performance to CPU frequency.
> 
> As Xen is uncapable of parsing the ACPI dynamic table, this commit
> introduces a new sub-hypercall to propagate required CPPC data from
> dom0 kernel.

Nit: Here and ...

> If the platform supports CPPC, the _CPC object must exist under all
> processor objects. That is, Xen is not expected to support mixed mode
> (CPPC & legacy PSS, _PCT, _PPC) operation, either advanced CPPC, or legacy
> P-states.
> 
> This commit also introduces a new flag XEN_PM_CPPC to reflect processor
> initialised in CPPC mode.

... here and elsewhere: Please avoid "this commit", "this patch", or
anything alike in patch descriptions.

Apart from this I'm not sure how useful review here is going to be, as there
apparently is a dependency on the problematic aspect in patch 1. Therefore
I'll give only a few independent comments.

> @@ -606,6 +616,41 @@ int set_psd_pminfo(uint32_t acpi_id, uint32_t shared_type,
>      return ret;
>  }
>  
> +int set_cppc_pminfo(uint32_t acpi_id,
> +                    const struct xen_processor_cppc *cppc_data)
> +{
> +    int ret = 0, cpuid;
> +    struct processor_pminfo *pm_info;
> +
> +    cpuid = get_cpu_id(acpi_id);
> +    if ( cpuid < 0 || !cppc_data )
> +    {
> +        ret = -EINVAL;
> +        goto out;
> +    }
> +    if ( cpufreq_verbose )
> +        printk("Set CPU acpi_id(%d) cpuid(%d) CPPC State info:\n",
> +               acpi_id, cpuid);

Nit: %d isn't appropriate for a variable/parameter of type uint32_t. In turn I
don't think the parameter needs to be of a fixed-width type; unsigned int will
be quite fine there, I expect. See ./CODING_STYLE.

> +    pm_info = processor_pminfo[cpuid];
> +    /* Must already allocated in set_psd_pminfo */
> +    if ( !pm_info )
> +    {
> +        ret = -EINVAL;
> +        goto out;
> +    }
> +    pm_info->cppc_data = *cppc_data;
> +
> +    if ( cpufreq_verbose )
> +        print_CPPC(&pm_info->cppc_data);
> +
> +    pm_info->init = XEN_CPPC_INIT;

That is - whichever Dom0 invoked last will have data recorded, and the other
effectively is discarded? I think a warning (perhaps a one-time one) is minimally
needed to diagnose the case where one type of data replaces the other.

With this it also remains unclear to me how fallback to the legacy driver is
intended to be working. Both taken together are a strong suggestion that important
information on the model that is being implemented is missing from the description.

> @@ -27,8 +28,6 @@ struct processor_performance {
>      struct xen_pct_register status_register;
>      uint32_t state_count;
>      struct xen_processor_px *states;
> -
> -    uint32_t init;
>  };
>  
>  struct processor_pminfo {
> @@ -37,6 +36,9 @@ struct processor_pminfo {
>      struct xen_psd_package domain_info;
>      uint32_t shared_type;
>      struct processor_performance    perf;
> +    struct xen_processor_cppc cppc_data;
> +
> +    uint32_t init;
>  };

This moving of the "init" field and the mechanical changes coming with it
can likely be split out to a separate patch? Provided of course the movement
is still wanted/needed with patch 1 re-worked or dropped.

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 03/15] xen/cpufreq: refactor cmdline "cpufreq=xxx"
  2025-03-06  8:39 ` [PATCH v3 03/15] xen/cpufreq: refactor cmdline "cpufreq=xxx" Penny Zheng
@ 2025-03-24 15:00   ` Jan Beulich
  2025-03-26  7:20     ` Penny, Zheng
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2025-03-24 15:00 UTC (permalink / raw)
  To: Penny Zheng
  Cc: ray.huang, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Julien Grall, Roger Pau Monné, Stefano Stabellini, xen-devel

On 06.03.2025 09:39, Penny Zheng wrote:
> This commit includes the following modification:
> - Introduce helper function cpufreq_cmdline_parse_xen and
> cpufreq_cmdline_parse_hwp to tidy the different parsing path
> - Add helper cpufreq_opts_contain to ignore user redundant setting,
> like "cpufreq=hwp;hwp;xen"
> - Doc refinement

See my earlier comment as to wording to avoid. In descriptions and comments
it would also be nice if function names could be followed by () (and array
names then be followed by []) to clearly identify the nature of such
identifiers.

> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -535,7 +535,8 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
>    processor to autonomously force physical package components into idle state.
>    The default is enabled, but the option only applies when `hwp` is enabled.
>  
> -There is also support for `;`-separated fallback options:
> +User could use `;`-separated options to support universal options which they
> +would like to try on any agnostic platform, *but* under priority order, like
>  `cpufreq=hwp;xen,verbose`.  This first tries `hwp` and falls back to `xen` if
>  unavailable.  Note: The `verbose` suboption is handled globally.  Setting it
>  for either the primary or fallback option applies to both irrespective of where

What does "support" here mean? I fear I can't even suggest what else to use,
as I don't follow what additional information you mean to add here. Is a
change here really needed?

> --- a/xen/drivers/cpufreq/cpufreq.c
> +++ b/xen/drivers/cpufreq/cpufreq.c
> @@ -71,6 +71,46 @@ unsigned int __initdata cpufreq_xen_cnt = 1;
>  
>  static int __init cpufreq_cmdline_parse(const char *s, const char *e);
>  
> +static bool __init cpufreq_opts_contain(enum cpufreq_xen_opt option)
> +{
> +    unsigned int count = cpufreq_xen_cnt;
> +
> +    while ( count )
> +    {
> +        if ( cpufreq_xen_opts[--count] == option )
> +            return true;
> +    }
> +
> +    return false;
> +}
> +
> +static int __init cpufreq_cmdline_parse_xen(const char *arg, const char *end)
> +{
> +    int ret = 0;
> +
> +    xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
> +    cpufreq_controller = FREQCTL_xen;
> +    cpufreq_xen_opts[cpufreq_xen_cnt++] = CPUFREQ_xen;
> +    ret = 0;

ret was already set to 0 by the initializer.

> +    if ( arg[0] && arg[1] )
> +        ret = cpufreq_cmdline_parse(arg + 1, end);
> +
> +    return ret;
> +}
> +
> +static int __init cpufreq_cmdline_parse_hwp(const char *arg, const char *end)
> +{
> +    int ret = 0;
> +
> +    xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
> +    cpufreq_controller = FREQCTL_xen;
> +    cpufreq_xen_opts[cpufreq_xen_cnt++] = CPUFREQ_hwp;
> +    if ( arg[0] && arg[1] )
> +        ret = hwp_cmdline_parse(arg + 1, end);
> +
> +    return ret;
> +}

For both of the helpers may I suggest s/parse/process/ or some such
("handle" might be another possible term to use), as themselves they
don't do any parsing?

In the end I'm also not entirely convinced that we need these two almost
identical helpers (with a 3rd likely appearing in a later patch).

> @@ -112,25 +152,13 @@ static int __init cf_check setup_cpufreq_option(const char *str)
>          if ( cpufreq_xen_cnt == ARRAY_SIZE(cpufreq_xen_opts) )
>              return -E2BIG;
>  
> -        if ( choice > 0 || !cmdline_strcmp(str, "xen") )
> -        {
> -            xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
> -            cpufreq_controller = FREQCTL_xen;
> -            cpufreq_xen_opts[cpufreq_xen_cnt++] = CPUFREQ_xen;
> -            ret = 0;
> -            if ( arg[0] && arg[1] )
> -                ret = cpufreq_cmdline_parse(arg + 1, end);
> -        }
> +        if ( (choice > 0 || !cmdline_strcmp(str, "xen")) &&
> +             !cpufreq_opts_contain(CPUFREQ_xen) )
> +            ret = cpufreq_cmdline_parse_xen(arg, end);
>          else if ( IS_ENABLED(CONFIG_INTEL) && choice < 0 &&
> -                  !cmdline_strcmp(str, "hwp") )
> -        {
> -            xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
> -            cpufreq_controller = FREQCTL_xen;
> -            cpufreq_xen_opts[cpufreq_xen_cnt++] = CPUFREQ_hwp;
> -            ret = 0;
> -            if ( arg[0] && arg[1] )
> -                ret = hwp_cmdline_parse(arg + 1, end);
> -        }
> +                  !cmdline_strcmp(str, "hwp") &&
> +                  !cpufreq_opts_contain(CPUFREQ_hwp) )
> +            ret = cpufreq_cmdline_parse_hwp(arg, end);
>          else
>              ret = -EINVAL;

Hmm, if I'm not mistaken the example "cpufreq=hwp;hwp;xen" would lead us
to this -EINVAL then. That's not quite "ignore" as the description says.

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 04/15] xen/cpufreq: move XEN_PROCESSOR_PM_xxx to internal header
  2025-03-06  8:39 ` [PATCH v3 04/15] xen/cpufreq: move XEN_PROCESSOR_PM_xxx to internal header Penny Zheng
@ 2025-03-24 15:11   ` Jan Beulich
  2025-03-26  7:48     ` Penny, Zheng
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2025-03-24 15:11 UTC (permalink / raw)
  To: Penny Zheng
  Cc: ray.huang, Andrew Cooper, Roger Pau Monné, Anthony PERARD,
	Michal Orzel, Julien Grall, Stefano Stabellini, xen-devel

On 06.03.2025 09:39, Penny Zheng wrote:
> XEN_PROCESSOR_PM_xxx are used to set xen_processor_pmbits only, which is
> a Xen-internal variable only. Although PV Dom0 passed these bits in si->flags,
> they haven't been used anywhere.

Please be careful with "not used anywhere". See e.g.
https://xenbits.xen.org/gitweb/?p=legacy/linux-2.6.18-xen.git;a=blob;f=arch/i386/kernel/acpi/processor_extcntl_xen.c;h=eb6a53e9572c137da505a7d4970b1a5b7e1c522d;hb=HEAD#l193

> So this commit moves XEN_PROCESSOR_PM_xxx back to internal header
> "acpi/cpufreq/processor_perf.h"

Essentially you're again altering the stable public ABI in a way that's not
acceptable.

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 05/15] xen/x86: introduce "cpufreq=amd-cppc" xen cmdline
  2025-03-06  8:39 ` [PATCH v3 05/15] xen/x86: introduce "cpufreq=amd-cppc" xen cmdline Penny Zheng
@ 2025-03-24 15:26   ` Jan Beulich
  2025-03-26  8:35     ` Penny, Zheng
  2025-03-25 10:00   ` Jan Beulich
  1 sibling, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2025-03-24 15:26 UTC (permalink / raw)
  To: Penny Zheng
  Cc: ray.huang, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Julien Grall, Roger Pau Monné, Stefano Stabellini, xen-devel

On 06.03.2025 09:39, Penny Zheng wrote:
> @@ -514,5 +515,14 @@ acpi_cpufreq_driver = {
>  
>  int __init acpi_cpufreq_register(void)
>  {
> -    return cpufreq_register_driver(&acpi_cpufreq_driver);
> +    int ret;
> +
> +    ret = cpufreq_register_driver(&acpi_cpufreq_driver);
> +    if ( ret )
> +        return ret;
> +
> +    if ( IS_ENABLED(CONFIG_AMD) )
> +        xen_processor_pmbits &= ~XEN_PROCESSOR_PM_CPPC;

What's the purpose of the if() here?

> @@ -157,7 +161,35 @@ static int __init cf_check cpufreq_driver_init(void)
>  
>          case X86_VENDOR_AMD:
>          case X86_VENDOR_HYGON:
> -            ret = IS_ENABLED(CONFIG_AMD) ? powernow_register_driver() : -ENODEV;
> +            if ( !IS_ENABLED(CONFIG_AMD) )
> +            {
> +                ret = -ENODEV;
> +                break;
> +            }
> +            ret = -ENOENT;
> +
> +            for ( unsigned int i = 0; i < cpufreq_xen_cnt; i++ )
> +            {
> +                switch ( cpufreq_xen_opts[i] )
> +                {
> +                case CPUFREQ_xen:
> +                    ret = powernow_register_driver();
> +                    break;
> +                case CPUFREQ_amd_cppc:
> +                    ret = amd_cppc_register_driver();
> +                    break;
> +                case CPUFREQ_none:
> +                    ret = 0;
> +                    break;
> +                default:
> +                    printk(XENLOG_WARNING
> +                           "Unsupported cpufreq driver for vendor AMD\n");

What about Hygon?

> --- a/xen/include/acpi/cpufreq/cpufreq.h
> +++ b/xen/include/acpi/cpufreq/cpufreq.h
> @@ -28,6 +28,7 @@ enum cpufreq_xen_opt {
>      CPUFREQ_none,
>      CPUFREQ_xen,
>      CPUFREQ_hwp,
> +    CPUFREQ_amd_cppc,
>  };
>  extern enum cpufreq_xen_opt cpufreq_xen_opts[2];

I'm pretty sure I pointed out before that this array needs to grow, now that
you add a 3rd kind of handling.

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 06/15] xen/cpufreq: disable px statistic info in amd-cppc mode
  2025-03-06  8:39 ` [PATCH v3 06/15] xen/cpufreq: disable px statistic info in amd-cppc mode Penny Zheng
@ 2025-03-24 15:34   ` Jan Beulich
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2025-03-24 15:34 UTC (permalink / raw)
  To: Penny Zheng; +Cc: ray.huang, xen-devel

On 06.03.2025 09:39, Penny Zheng wrote:
> Bypass cnstruction and deconstruction for px statistic info(
> cpufreq_statistic_init and cpufreq_statistic_exit) in cpufreq
> CPPC mode.

You say what you do, but not why.

> --- a/xen/drivers/cpufreq/utility.c
> +++ b/xen/drivers/cpufreq/utility.c
> @@ -98,6 +98,9 @@ int cpufreq_statistic_init(unsigned int cpu)
>      if ( !pmpt )
>          return -EINVAL;
>  
> +    if ( !(pmpt->init & XEN_PX_INIT) )
> +        return 0;

I understand this is needed if statistics really are of no interest for this
driver (which needs to be clarified in the description). However, ...

> @@ -147,8 +150,12 @@ int cpufreq_statistic_init(unsigned int cpu)
>  void cpufreq_statistic_exit(unsigned int cpu)
>  {
>      struct pm_px *pxpt;
> +    const struct processor_pminfo *pmpt = processor_pminfo[cpu];
>      spinlock_t *cpufreq_statistic_lock = &per_cpu(cpufreq_statistic_lock, cpu);
>  
> +    if ( !(pmpt->init & XEN_PX_INIT) )
> +        return;
> +
>      spin_lock(cpufreq_statistic_lock);
>  
>      pxpt = per_cpu(cpufreq_statistic_data, cpu);

... why's this needed, when below here there already is:

    if (!pxpt) {
        spin_unlock(cpufreq_statistic_lock);
        return;
    }

?

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 07/15] xen/cpufreq: fix core frequency calculation for AMD Family 1Ah CPUs
  2025-03-06  8:39 ` [PATCH v3 07/15] xen/cpufreq: fix core frequency calculation for AMD Family 1Ah CPUs Penny Zheng
@ 2025-03-24 15:47   ` Jan Beulich
  2025-03-25 10:55     ` Nicola Vetrini
  2025-03-26  9:54     ` Penny, Zheng
  0 siblings, 2 replies; 60+ messages in thread
From: Jan Beulich @ 2025-03-24 15:47 UTC (permalink / raw)
  To: Penny Zheng
  Cc: ray.huang, Andrew Cooper, Roger Pau Monné, xen-devel,
	Nicola Vetrini

On 06.03.2025 09:39, Penny Zheng wrote:
> This commit fixes core frequency calculation for AMD Family 1Ah CPUs, due to
> a change in the PStateDef MSR layout in AMD Family 1Ah+.
> In AMD Family 1Ah+, Core current operating frequency in MHz is calculated as
> follows:

Why 1Ah+? In the code you correctly limit to just 1Ah.

> --- a/xen/arch/x86/cpu/amd.c
> +++ b/xen/arch/x86/cpu/amd.c
> @@ -572,12 +572,24 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
>                                                            : c->cpu_core_id);
>  }
>  
> +static uint64_t amd_parse_freq(const struct cpuinfo_x86 *c, uint64_t value)
> +{
> +	ASSERT(c->x86 <= 0x1A);
> +
> +	if (c->x86 < 0x17)
> +		return (((value & 0x3f) + 0x10) * 100) >> ((value >> 6) & 7);
> +	else if (c->x86 <= 0x19)
> +		return ((value & 0xff) * 25 * 8) / ((value >> 8) & 0x3f);
> +	else
> +		return (value & 0xfff) * 5;
> +}

Could I talk you into omitting the unnecessary "else" in cases like this one?
(This may also make sense to express as switch().)

> @@ -658,19 +670,20 @@ void amd_log_freq(const struct cpuinfo_x86 *c)
>  	if (!(lo >> 63))
>  		return;
>  
> -#define FREQ(v) (c->x86 < 0x17 ? ((((v) & 0x3f) + 0x10) * 100) >> (((v) >> 6) & 7) \
> -		                     : (((v) & 0xff) * 25 * 8) / (((v) >> 8) & 0x3f))
>  	if (idx && idx < h &&
>  	    !rdmsr_safe(0xC0010064 + idx, val) && (val >> 63) &&
>  	    !rdmsr_safe(0xC0010064, hi) && (hi >> 63))
>  		printk("CPU%u: %lu (%lu ... %lu) MHz\n",
> -		       smp_processor_id(), FREQ(val), FREQ(lo), FREQ(hi));
> +		       smp_processor_id(),
> +		       amd_parse_freq(c, val),
> +		       amd_parse_freq(c, lo), amd_parse_freq(c, hi));

I fear Misra won't like multiple function calls to evaluate the parameters
to pass to another function. Iirc smp_process_id() has special exception,
so that's okay here. This may be possible to alleviate by marking the new
helper pure or even const (see gcc doc as to caveats with passing pointers
to const functions). Cc-ing Nicola for possible clarification or correction.

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 08/15] xen/amd: export processor max frequency value
  2025-03-06  8:39 ` [PATCH v3 08/15] xen/amd: export processor max frequency value Penny Zheng
@ 2025-03-24 15:52   ` Jan Beulich
  2025-03-27  8:38     ` Penny, Zheng
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2025-03-24 15:52 UTC (permalink / raw)
  To: Penny Zheng; +Cc: ray.huang, Andrew Cooper, Roger Pau Monné, xen-devel

On 06.03.2025 09:39, Penny Zheng wrote:
> --- a/xen/arch/x86/cpu/amd.c
> +++ b/xen/arch/x86/cpu/amd.c
> @@ -56,6 +56,8 @@ bool __initdata amd_virt_spec_ctrl;
>  
>  static bool __read_mostly fam17_c6_disabled;
>  
> +DEFINE_PER_CPU_READ_MOSTLY(uint64_t, amd_max_freq_mhz);
> +
>  static inline int rdmsr_amd_safe(unsigned int msr, unsigned int *lo,
>  				 unsigned int *hi)
>  {
> @@ -681,9 +683,15 @@ void amd_log_freq(const struct cpuinfo_x86 *c)
>  		printk("CPU%u: %lu ... %lu MHz\n",
>  		       smp_processor_id(),
>  		       amd_parse_freq(c, lo), amd_parse_freq(c, hi));
> -	else
> +	else {
>  		printk("CPU%u: %lu MHz\n", smp_processor_id(),
>  		       amd_parse_freq(c, lo));
> +		return;
> +	}
> +
> +	/* Store max frequency for amd-cppc cpufreq driver */
> +	if (hi >> 63)
> +		this_cpu(amd_max_freq_mhz) = amd_parse_freq(c, hi);
>  }

As before - typically only the BSP will make it here, due to the conditional
at the top of the function. IOW you'll observe zeros in the per-CPU data for
all other CPUs.

> --- a/xen/arch/x86/include/asm/amd.h
> +++ b/xen/arch/x86/include/asm/amd.h
> @@ -174,4 +174,5 @@ bool amd_setup_legacy_ssbd(void);
>  void amd_set_legacy_ssbd(bool enable);
>  void amd_set_cpuid_user_dis(bool enable);
>  
> +DECLARE_PER_CPU(uint64_t, amd_max_freq_mhz);
>  #endif /* __AMD_H__ */

I'm also pretty sure that I did ask before to maintain a blank line ahead of
the #endif. Please may I ask that you thoroughly address earlier review
comments, before submitting a new version?

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 10/15] xen/cpufreq: only set gov NULL when cpufreq_driver.setpolicy is NULL
  2025-03-06  8:39 ` [PATCH v3 10/15] xen/cpufreq: only set gov NULL when cpufreq_driver.setpolicy is NULL Penny Zheng
@ 2025-03-24 16:32   ` Jan Beulich
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2025-03-24 16:32 UTC (permalink / raw)
  To: Penny Zheng; +Cc: ray.huang, xen-devel

On 06.03.2025 09:39, Penny Zheng wrote:
> --- a/xen/drivers/cpufreq/cpufreq.c
> +++ b/xen/drivers/cpufreq/cpufreq.c
> @@ -353,7 +353,13 @@ int cpufreq_add_cpu(unsigned int cpu)
>      if (hw_all || (cpumask_weight(cpufreq_dom->map) ==
>                     pmpt->domain_info.num_processors)) {
>          memcpy(&new_policy, policy, sizeof(struct cpufreq_policy));
> -        policy->governor = NULL;
> +
> +       /*
> +        * Only when cpufreq_driver.setpolicy == NULL, we need to deliberately
> +        * set old gov as NULL to trigger the according gov starting.
> +        */
> +       if ( cpufreq_driver.setpolicy == NULL )
> +            policy->governor = NULL;
>  
>          cpufreq_cmdline_common_para(&new_policy);

Indentation looks off-by-1 here.

Also (I may have asked this before, but couldn't find an indication in this
submission, including in the cover letter): Is this independent of all earlier
patches in the series, and could hence go in right away?

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v3 02/15] xen/x86: introduce new sub-hypercall to propagate CPPC data
  2025-03-24 14:28   ` Jan Beulich
@ 2025-03-25  4:12     ` Penny, Zheng
  2025-03-25  7:53       ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Penny, Zheng @ 2025-03-25  4:12 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Huang, Ray, Andrew Cooper, Roger Pau Monné, Anthony PERARD,
	Orzel, Michal, Julien Grall, Stefano Stabellini,
	xen-devel@lists.xenproject.org

[Public]

Hi,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, March 24, 2025 10:28 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Roger Pau Monné <roger.pau@citrix.com>;
> Anthony PERARD <anthony.perard@vates.tech>; Orzel, Michal
> <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Stefano Stabellini
> <sstabellini@kernel.org>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v3 02/15] xen/x86: introduce new sub-hypercall to propagate
> CPPC data
>
> On 06.03.2025 09:39, Penny Zheng wrote:
> > +    pm_info = processor_pminfo[cpuid];
> > +    /* Must already allocated in set_psd_pminfo */
> > +    if ( !pm_info )
> > +    {
> > +        ret = -EINVAL;
> > +        goto out;
> > +    }
> > +    pm_info->cppc_data = *cppc_data;
> > +
> > +    if ( cpufreq_verbose )
> > +        print_CPPC(&pm_info->cppc_data);
> > +
> > +    pm_info->init = XEN_CPPC_INIT;
>
> That is - whichever Dom0 invoked last will have data recorded, and the other
> effectively is discarded? I think a warning (perhaps a one-time one) is minimally
> needed to diagnose the case where one type of data replaces the other.
>

In last v2 discussion, we are discussing that either set_px_pminfo or set_cppc_pminfo shall be invoked,
which means either PX data is recorded, or CPPC data is recorded.
Current logic is that, cpufreq cmdline logic will set the XEN_PROCESSOR_PM_PX/CPPC
flag to reflect user preference, if user defines the fallback option, like "cpufreq=amd-cppc,xen", we will have both
 XEN_PROCESSOR_PM_PX | XEN_PROCESSOR_PM_CPPC set in the beginning.
Later in cpufreq driver register logic, as only one register could be registered , if amd-cppc
being registered successfully, it will clear the  XEN_PROCESSOR_PM_PX flag bit.
But if it fails to register, fallback scheme kicks off, we will try the legacy P-states, in the mean time,
clearing the XEN_PROCESSOR_PM_CPPC.
We are trying to make XEN_PROCESSOR_PM_PX and XEN_PROCESSOR_PM_CPPC exclusive
values after driver registration, which will ensure us that either set_px_pminfo or set_cppc_pminfo
is taken in the runtime.

> With this it also remains unclear to me how fallback to the legacy driver is intended
> to be working. Both taken together are a strong suggestion that important
> information on the model that is being implemented is missing from the description.
>
> > @@ -27,8 +28,6 @@ struct processor_performance {
> >      struct xen_pct_register status_register;
> >      uint32_t state_count;
> >      struct xen_processor_px *states;
> > -
> > -    uint32_t init;
> >  };
> >
> >  struct processor_pminfo {
> > @@ -37,6 +36,9 @@ struct processor_pminfo {
> >      struct xen_psd_package domain_info;
> >      uint32_t shared_type;
> >      struct processor_performance    perf;
> > +    struct xen_processor_cppc cppc_data;
> > +
> > +    uint32_t init;
> >  };
>
> This moving of the "init" field and the mechanical changes coming with it can likely
> be split out to a separate patch? Provided of course the movement is still
> wanted/needed with patch 1 re-worked or dropped.
>
> Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 02/15] xen/x86: introduce new sub-hypercall to propagate CPPC data
  2025-03-25  4:12     ` Penny, Zheng
@ 2025-03-25  7:53       ` Jan Beulich
  2025-03-28  8:27         ` Penny, Zheng
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2025-03-25  7:53 UTC (permalink / raw)
  To: Penny, Zheng
  Cc: Huang, Ray, Andrew Cooper, Roger Pau Monné, Anthony PERARD,
	Orzel, Michal, Julien Grall, Stefano Stabellini,
	xen-devel@lists.xenproject.org

On 25.03.2025 05:12, Penny, Zheng wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Monday, March 24, 2025 10:28 PM
>>
>> On 06.03.2025 09:39, Penny Zheng wrote:
>>> +    pm_info = processor_pminfo[cpuid];
>>> +    /* Must already allocated in set_psd_pminfo */
>>> +    if ( !pm_info )
>>> +    {
>>> +        ret = -EINVAL;
>>> +        goto out;
>>> +    }
>>> +    pm_info->cppc_data = *cppc_data;
>>> +
>>> +    if ( cpufreq_verbose )
>>> +        print_CPPC(&pm_info->cppc_data);
>>> +
>>> +    pm_info->init = XEN_CPPC_INIT;
>>
>> That is - whichever Dom0 invoked last will have data recorded, and the other
>> effectively is discarded? I think a warning (perhaps a one-time one) is minimally
>> needed to diagnose the case where one type of data replaces the other.
>>
> 
> In last v2 discussion, we are discussing that either set_px_pminfo or set_cppc_pminfo shall be invoked,
> which means either PX data is recorded, or CPPC data is recorded.
> Current logic is that, cpufreq cmdline logic will set the XEN_PROCESSOR_PM_PX/CPPC
> flag to reflect user preference, if user defines the fallback option, like "cpufreq=amd-cppc,xen", we will have both
>  XEN_PROCESSOR_PM_PX | XEN_PROCESSOR_PM_CPPC set in the beginning.
> Later in cpufreq driver register logic, as only one register could be registered , if amd-cppc
> being registered successfully, it will clear the  XEN_PROCESSOR_PM_PX flag bit.
> But if it fails to register, fallback scheme kicks off, we will try the legacy P-states, in the mean time,
> clearing the XEN_PROCESSOR_PM_CPPC.
> We are trying to make XEN_PROCESSOR_PM_PX and XEN_PROCESSOR_PM_CPPC exclusive
> values after driver registration, which will ensure us that either set_px_pminfo or set_cppc_pminfo
> is taken in the runtime.

Yet you realize that this implies Dom0 to know what configuration Xen uses,
in order to know which data to upload. The best approach might be to have
Dom0 upload all data it has, with us merely ignoring what we can't make use
of. The order of uploading (CPPC first or CPPC last) shouldn't matter. Then
(and only then, and - ftaod - only when uploading of the "wrong" kind of
data doesn't result in an error) things can go without warning.

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 09/15] xen/x86: introduce a new amd cppc driver for cpufreq scaling
  2025-03-06  8:39 ` [PATCH v3 09/15] xen/x86: introduce a new amd cppc driver for cpufreq scaling Penny Zheng
@ 2025-03-25  9:57   ` Jan Beulich
  2025-03-25 13:58     ` Jason Andryuk
  2025-04-03  7:40     ` Penny, Zheng
  0 siblings, 2 replies; 60+ messages in thread
From: Jan Beulich @ 2025-03-25  9:57 UTC (permalink / raw)
  To: Penny Zheng
  Cc: ray.huang, Andrew Cooper, Roger Pau Monné, xen-devel,
	Jason Andryuk

On 06.03.2025 09:39, Penny Zheng wrote:
> v2 -> v3:
> - Move all MSR-definations to msr-index.h and follow the required style
> - Refactor opening figure braces for struct/union
> - Sort overlong lines throughout the series
> - Make offset/res int covering underflow scenario
> - Error out when amd_max_freq_mhz isn't set

Given the issue with the patch filling amd_max_freq_mhz I wonder how you
successfully tested this patch here.

> - Introduce amd_get_freq(name) macro to decrease redundancy

Hmm, that's not quite what I was hoping for. I'll comment there in more
detail.

> --- a/xen/arch/x86/acpi/cpufreq/amd-cppc.c
> +++ b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
> @@ -14,7 +14,50 @@
>  #include <xen/domain.h>
>  #include <xen/init.h>
>  #include <xen/param.h>
> +#include <xen/percpu.h>
> +#include <xen/xvmalloc.h>
>  #include <acpi/cpufreq/cpufreq.h>
> +#include <asm/amd.h>
> +#include <asm/msr-index.h>
> +
> +#define amd_cppc_err(cpu, fmt, args...)                     \
> +    printk(XENLOG_ERR "AMD_CPPC: CPU%u error: " fmt, cpu, ## args)
> +#define amd_cppc_warn(fmt, args...)                         \
> +    printk(XENLOG_WARNING "AMD_CPPC: CPU%u warning: " fmt, cpu, ## args)
> +#define amd_cppc_verbose(fmt, args...)                      \
> +({                                                          \
> +    if ( cpufreq_verbose )                                  \
> +        printk(XENLOG_DEBUG "AMD_CPPC: " fmt, ## args);     \
> +})

Why would warning and error come with a CPU number at all times, but not
the verbose construct?

> +struct amd_cppc_drv_data
> +{
> +    const struct xen_processor_cppc *cppc_data;
> +    union {
> +        uint64_t raw;
> +        struct {
> +            unsigned int lowest_perf:8;
> +            unsigned int lowest_nonlinear_perf:8;
> +            unsigned int nominal_perf:8;
> +            unsigned int highest_perf:8;
> +            unsigned int :32;
> +        };
> +    } caps;
> +    union {
> +        uint64_t raw;
> +        struct {
> +            unsigned int max_perf:8;
> +            unsigned int min_perf:8;
> +            unsigned int des_perf:8;
> +            unsigned int epp:8;
> +            unsigned int :32;
> +        };
> +    } req;
> +
> +    int err;
> +};
> +
> +static DEFINE_PER_CPU_READ_MOSTLY(struct amd_cppc_drv_data *, amd_cppc_drv_data);

Nit: Line length. I wonder what "Sort overlong lines throughout the series"
is meant to say in the revision log.

> @@ -51,10 +94,337 @@ int __init amd_cppc_cmdline_parse(const char *s, const char *e)
>      return 0;
>  }
>  
> +/*
> + * If CPPC lowest_freq and nominal_freq registers are exposed then we can
> + * use them to convert perf to freq and vice versa. The conversion is
> + * extrapolated as an linear function passing by the 2 points:
> + *  - (Low perf, Low freq)
> + *  - (Nominal perf, Nominal freq)
> + */
> +static int amd_cppc_khz_to_perf(const struct amd_cppc_drv_data *data,
> +                                unsigned int freq, uint8_t *perf)
> +{
> +    const struct xen_processor_cppc *cppc_data = data->cppc_data;
> +    uint64_t mul, div;
> +    int offset = 0, res;
> +
> +    if ( freq == (cppc_data->nominal_mhz * 1000) )
> +    {
> +        *perf = data->caps.nominal_perf;
> +        return 0;
> +    }
> +
> +    if ( freq == (cppc_data->lowest_mhz * 1000) )
> +    {
> +        *perf = data->caps.lowest_perf;
> +        return 0;
> +    }
> +
> +    if ( cppc_data->lowest_mhz && cppc_data->nominal_mhz )
> +    {
> +        mul = data->caps.nominal_perf - data->caps.lowest_perf;
> +        div = cppc_data->nominal_mhz - cppc_data->lowest_mhz;
> +        /*
> +         * We don't need to convert to KHz for computing offset and can

Nit: kHz (i.e. unlike MHz)

> +         * directly use nominal_mhz and lowest_mhz as the division
> +         * will remove the frequency unit.
> +         */
> +        div = div ?: 1;

Imo the cppc_data->lowest_mhz >= cppc_data->nominal_mhz case better
wouldn't make it here, but use the fallback path below. Or special-
case cppc_data->lowest_mhz == cppc_data->nominal_mhz: mul would
(hopefully) be zero (i.e. there would be the expectation that
data->caps.nominal_perf == data->caps.lowest_perf, yet no guarantee
without checking), and hence ...

> +        offset = data->caps.nominal_perf -
> +                 (mul * cppc_data->nominal_mhz) / div;

... offset = data->caps.nominal_perf regardless of "div" (as long
as that's not zero). I.e. the "equal" case may still be fine to take
this path.

Or is there a check somewhere that lowest_mhz <= nominal_mhz and
lowest_perf <= nominal_perf, which I'm simply overlooking?

> +    }
> +    else
> +    {
> +        /* Read Processor Max Speed(mhz) as anchor point */
> +        mul = data->caps.highest_perf;
> +        div = this_cpu(amd_max_freq_mhz);
> +        if ( !div )
> +            return -EINVAL;
> +    }
> +
> +    res = offset + (mul * freq) / (div * 1000);
> +    if ( res > UINT8_MAX )

I can't quite convince myself that res can't end up negative here, in
which case ...

> +    {
> +        printk_once(XENLOG_WARNING
> +                    "Perf value exceeds maximum value 255: %d\n", res);
> +        *perf = 0xff;
> +        return 0;
> +    }
> +    *perf = (uint8_t)res;

... a bogus value would be stored here.

> +    return 0;
> +}
> +
> +#define amd_get_freq(name)                                                  \

The macro parameter is used just ...

> +    static int amd_get_##name##_freq(const struct amd_cppc_drv_data *data,  \

... here, ...

> +                                     unsigned int *freq)                    \
> +    {                                                                       \
> +        const struct xen_processor_cppc *cppc_data = data->cppc_data;       \
> +        uint64_t mul, div, res;                                             \
> +                                                                            \
> +        if ( cppc_data->name##_mhz )                                        \
> +        {                                                                   \
> +            /* Switch to khz */                                             \
> +            *freq = cppc_data->name##_mhz * 1000;                           \

... twice here forthe MHz value, and ...

> +            return 0;                                                       \
> +        }                                                                   \
> +                                                                            \
> +        /* Read Processor Max Speed(mhz) as anchor point */                 \
> +        mul = this_cpu(amd_max_freq_mhz);                                   \
> +        if ( !mul )                                                         \
> +            return -EINVAL;                                                 \
> +        div = data->caps.highest_perf;                                      \
> +        res = (mul * data->caps.name##_perf * 1000) / div;                  \

... here for the respective perf indicator. Why does it take ...

> +        if ( res > UINT_MAX )                                               \
> +        {                                                                   \
> +            printk(XENLOG_ERR                                               \
> +                   "Frequeny exceeds maximum value UINT_MAX: %lu\n", res);  \
> +            return -EINVAL;                                                 \
> +        }                                                                   \
> +        *freq = (unsigned int)res;                                          \
> +                                                                            \
> +        return 0;                                                           \
> +    }                                                                       \
> +
> +amd_get_freq(lowest);
> +amd_get_freq(nominal);

... two almost identical functions, when one (with two extra input parameters)
would suffice?

In amd_cppc_khz_to_perf() you have a check to avoid division by zero. Why
not the same safeguarding here?

> +static int amd_get_max_freq(const struct amd_cppc_drv_data *data,
> +                            unsigned int *max_freq)
> +{
> +    unsigned int nom_freq, boost_ratio;
> +    int res;
> +
> +    res = amd_get_nominal_freq(data, &nom_freq);
> +    if ( res )
> +        return res;
> +
> +    boost_ratio = (unsigned int)(data->caps.highest_perf /
> +                                 data->caps.nominal_perf);

Similarly here - I can't spot what would prevent division by zero.

> +    *max_freq = nom_freq * boost_ratio;

Nor is it clear to me why (with bogus MSR contents) boost_ratio couldn't
end up being zero, and hence we'd report back ...

> +    return 0;

... success with a frequency of 0.

> +}
> +
> +static int cf_check amd_cppc_cpufreq_verify(struct cpufreq_policy *policy)
> +{
> +    cpufreq_verify_within_limits(policy, policy->cpuinfo.min_freq,
> +                                 policy->cpuinfo.max_freq);
> +
> +    return 0;
> +}
> +
> +static void amd_cppc_write_request_msrs(void *info)
> +{
> +    struct amd_cppc_drv_data *data = info;
> +
> +    if ( wrmsr_safe(MSR_AMD_CPPC_REQ, data->req.raw) )
> +    {
> +        data->err = -EINVAL;
> +        return;
> +    }
> +}
> +
> +static int cf_check amd_cppc_write_request(unsigned int cpu, uint8_t min_perf,
> +                                           uint8_t des_perf, uint8_t max_perf)

The cf_check looks to be misplaced here, and rather wants to go to
amd_cppc_write_request_msrs() because of ...

> +{
> +    struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data, cpu);
> +    uint64_t prev = data->req.raw;
> +
> +    data->req.min_perf = min_perf;
> +    data->req.max_perf = max_perf;
> +    data->req.des_perf = des_perf;
> +
> +    if ( prev == data->req.raw )
> +        return 0;
> +
> +    data->err = 0;
> +    on_selected_cpus(cpumask_of(cpu), amd_cppc_write_request_msrs, data, 1);

... this use of a function pointer here.

> +    return data->err;
> +}
> +
> +static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
> +                                            unsigned int target_freq,
> +                                            unsigned int relation)
> +{
> +    unsigned int cpu = policy->cpu;
> +    const struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data, cpu);
> +    uint8_t des_perf;
> +    int res;
> +
> +    if ( unlikely(!target_freq) )
> +        return 0;

Checking other *_cpufreq_target() functions, none would silently ignore
a zero input. (HWP's ignores the input altogether though; Cc-ing Jason
for possible clarification: I would have expected this driver here and
the HWP one to be similar in this regard.)

> +    res = amd_cppc_khz_to_perf(data, target_freq, &des_perf);
> +    if ( res )
> +        return res;
> +
> +    return amd_cppc_write_request(policy->cpu, data->caps.lowest_nonlinear_perf,
> +                                  des_perf, data->caps.highest_perf);

As before: the use of the "non-linear" value here wants to come with a
(perhaps brief) comment.

> +}
> +
> +static void cf_check amd_cppc_init_msrs(void *info)
> +{
> +    struct cpufreq_policy *policy = info;
> +    struct amd_cppc_drv_data *data = this_cpu(amd_cppc_drv_data);
> +    uint64_t val;
> +    unsigned int min_freq, nominal_freq, max_freq;
> +
> +    /* Package level MSR */
> +    if ( rdmsr_safe(MSR_AMD_CPPC_ENABLE, val) )
> +    {
> +        amd_cppc_err(policy->cpu, "rdmsr_safe(MSR_AMD_CPPC_ENABLE)\n");
> +        goto err;
> +    }
> +
> +    /*
> +     * Only when Enable bit is on, the hardware will calculate the processor’s
> +     * performance capabilities and initialize the performance level fields in
> +     * the CPPC capability registers.
> +     */
> +    if ( !(val & AMD_CPPC_ENABLE) )
> +    {
> +        val |= AMD_CPPC_ENABLE;
> +        if ( wrmsr_safe(MSR_AMD_CPPC_ENABLE, val) )
> +        {
> +            amd_cppc_err(policy->cpu,
> +                         "wrmsr_safe(MSR_AMD_CPPC_ENABLE, %lx)\n", val);
> +            goto err;
> +        }
> +    }
> +
> +    if ( rdmsr_safe(MSR_AMD_CPPC_CAP1, data->caps.raw) )
> +    {
> +        amd_cppc_err(policy->cpu, "rdmsr_safe(MSR_AMD_CPPC_CAP1)\n");
> +        goto err;
> +    }
> +
> +    if ( data->caps.highest_perf == 0 || data->caps.lowest_perf == 0 ||
> +         data->caps.nominal_perf == 0 || data->caps.lowest_nonlinear_perf == 0 )
> +    {
> +        amd_cppc_err(policy->cpu,
> +                     "Platform malfunction, read CPPC highest_perf: %u, lowest_perf: %u, nominal_perf: %u, lowest_nonlinear_perf: %u zero value\n",

I don't think the _perf suffixes are overly relevant in the log message.

> +                     data->caps.highest_perf, data->caps.lowest_perf,
> +                     data->caps.nominal_perf, data->caps.lowest_nonlinear_perf);
> +        goto err;
> +    }
> +
> +    data->err = amd_get_lowest_freq(data, &min_freq);
> +    if ( data->err )
> +        return;
> +
> +    data->err = amd_get_nominal_freq(data, &nominal_freq);
> +    if ( data->err )
> +        return;
> +
> +    data->err = amd_get_max_freq(data, &max_freq);
> +    if ( data->err )
> +        return;
> +
> +    if ( min_freq > max_freq || nominal_freq > max_freq ||
> +         nominal_freq < min_freq )
> +    {
> +        amd_cppc_err(policy->cpu,
> +                     "min_freq(%u), or max_freq(%u), or nominal_freq(%u) value is incorrect\n",

Along the lines of the above, while it wants making clear here it's frequencies,
I question the use of identifier names to express that. E.g.
"min (%u), or max (%u), or nominal (%u) freq value is incorrect\n"?

> +                     min_freq, max_freq, nominal_freq);
> +        goto err;
> +    }
> +
> +    policy->min = min_freq;
> +    policy->max = max_freq;
> +
> +    policy->cpuinfo.min_freq = min_freq;
> +    policy->cpuinfo.max_freq = max_freq;
> +    policy->cpuinfo.perf_freq = nominal_freq;
> +    /*
> +     * Set after policy->cpuinfo.perf_freq, as we are taking
> +     * APERF/MPERF average frequency as current frequency.
> +     */
> +    policy->cur = cpufreq_driver_getavg(policy->cpu, GOV_GETAVG);
> +
> +    return;
> +
> + err:
> +    data->err = -EINVAL;

Is we make it here after having set the enable bit, we're hosed (afaict).
We can't fall back to another driver, and we also can't get this driver to
work. I think I did ask before that this be explained in a comment here.
(The only thing the user can do is, aiui, to change the command line option
and reboot.)

Oh, I see you have such a comment at the use site of this function. Please
have a brief comment here then, to refer there.

> +}
> +
> +/*
> + * The new AMD CPPC driver is different than legacy ACPI hardware P-State,

Please omit "new" - that'll be stale rather sooner than later.

> + * which has a finer grain frequency range between the highest and lowest
> + * frequency. And boost frequency is actually the frequency which is mapped on
> + * highest performance ratio. The legacy P0 frequency is actually mapped on
> + * nominal performance ratio.
> + */
> +static void amd_cppc_boost_init(struct cpufreq_policy *policy,
> +                                const struct amd_cppc_drv_data *data)
> +{
> +    if ( data->caps.highest_perf <= data->caps.nominal_perf )
> +        return;
> +
> +    policy->turbo = CPUFREQ_TURBO_ENABLED;
> +}
> +
> +static int cf_check amd_cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> +{
> +    XVFREE(per_cpu(amd_cppc_drv_data, policy->cpu));
> +
> +    return 0;
> +}
> +
> +static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
> +{
> +    unsigned int cpu = policy->cpu;
> +    struct amd_cppc_drv_data *data;
> +    const struct cpuinfo_x86 *c = cpu_data + cpu;
> +
> +    data = xvzalloc(struct amd_cppc_drv_data);
> +    if ( !data )
> +        return -ENOMEM;
> +
> +    data->cppc_data = &processor_pminfo[cpu]->cppc_data;
> +
> +    per_cpu(amd_cppc_drv_data, cpu) = data;
> +
> +    /* Feature CPPC is firstly introduced on Zen2 */
> +    if ( c->x86 < 0x17 )
> +    {
> +        printk_once("Unsupported cpu family: %x\n", c->x86);
> +        return -EOPNOTSUPP;
> +    }
> +
> +    on_selected_cpus(cpumask_of(cpu), amd_cppc_init_msrs, policy, 1);
> +
> +    /*
> +     * If error path takes effective, not only amd-cppc cpufreq driver fails
> +     * to initialize, but also we could not fall back to legacy P-states
> +     * driver nevertheless we specifies fall back option in cmdline.
> +     */

Nit: I'm not a native speaker, but I don't think "nevertheless" can be used here.
Maybe "... but we also cannot fall back to the legacy driver, irrespective of
the command line specifying a fallback option"?

Plus I think it would help to also explain why here, i.e. that the enable bit is
sticky.

> +    if ( data->err )
> +    {
> +        amd_cppc_err(cpu, "Could not initialize AMD CPPC MSR properly\n");
> +        amd_cppc_cpufreq_cpu_exit(policy);
> +        return -ENODEV;

Why do you not use data->err here?

> --- a/xen/arch/x86/include/asm/msr-index.h
> +++ b/xen/arch/x86/include/asm/msr-index.h
> @@ -238,6 +238,11 @@
>  
>  #define MSR_AMD_CSTATE_CFG                  0xc0010296U
>  
> +#define MSR_AMD_CPPC_CAP1                   0xc00102b0
> +#define MSR_AMD_CPPC_ENABLE                 0xc00102b1
> +#define  AMD_CPPC_ENABLE                    (_AC(1, ULL) <<  0)
> +#define MSR_AMD_CPPC_REQ                    0xc00102b3

As you can see from the pre-existing #define in context, there are U suffixes
missing here.

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 05/15] xen/x86: introduce "cpufreq=amd-cppc" xen cmdline
  2025-03-06  8:39 ` [PATCH v3 05/15] xen/x86: introduce "cpufreq=amd-cppc" xen cmdline Penny Zheng
  2025-03-24 15:26   ` Jan Beulich
@ 2025-03-25 10:00   ` Jan Beulich
  1 sibling, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2025-03-25 10:00 UTC (permalink / raw)
  To: Penny Zheng
  Cc: ray.huang, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Julien Grall, Roger Pau Monné, Stefano Stabellini, xen-devel

On 06.03.2025 09:39, Penny Zheng wrote:
> +static const struct cpufreq_driver __initconst_cf_clobber
> +amd_cppc_cpufreq_driver =
> +{
> +    .name   = XEN_AMD_CPPC_DRIVER_NAME,
> +};

Because of the hook pointers not being set right here, ...

> +int __init amd_cppc_register_driver(void)
> +{
> +    int ret;
> +
> +    if ( !cpu_has_cppc )
> +    {
> +        xen_processor_pmbits &= ~XEN_PROCESSOR_PM_CPPC;
> +        return -ENODEV;
> +    }
> +
> +    ret = cpufreq_register_driver(&amd_cppc_cpufreq_driver);
> +    if ( ret )
> +        return ret;

... this - afaict - will fail up until patch 09. This may want mentioning
in the description here. (Initially I thought you'd leave NULL derefs around
for several patches, until I checked cpufreq_register_driver().)

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 11/15] xen/cpufreq: abstract Energy Performance Preference value
  2025-03-06  8:39 ` [PATCH v3 11/15] xen/cpufreq: abstract Energy Performance Preference value Penny Zheng
@ 2025-03-25 10:13   ` Jan Beulich
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2025-03-25 10:13 UTC (permalink / raw)
  To: Penny Zheng; +Cc: ray.huang, Andrew Cooper, Roger Pau Monné, xen-devel

On 06.03.2025 09:39, Penny Zheng wrote:
> Intel's hwp Energy Performance Preference value is compatible with
> CPPC's Energy Performance Preference value, so this commit abstracts
> the value and re-place it in common header file cpufreq.h, to be
> used not only for hwp in the future.
> 
> Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
> Acked-by: Jan Beulich <jbeulich@suse.com>

Hmm, this had gone in already before you sent v3. Why was it nevertheless
included here?

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 12/15] xen/x86: implement EPP support for the amd-cppc driver in active mode
  2025-03-06  8:39 ` [PATCH v3 12/15] xen/x86: implement EPP support for the amd-cppc driver in active mode Penny Zheng
@ 2025-03-25 10:48   ` Jan Beulich
  2025-03-28  4:07     ` Penny, Zheng
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2025-03-25 10:48 UTC (permalink / raw)
  To: Penny Zheng
  Cc: ray.huang, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Julien Grall, Roger Pau Monné, Stefano Stabellini, xen-devel

On 06.03.2025 09:39, Penny Zheng wrote:
> amd-cppc has 2 operation modes: autonomous (active) mode,
> non-autonomous (passive) mode.
> In active mode, platform ignores the requestd done in the Desired
> Performance Target register and takes into account only the values
> set to the minimum, maximum and energy performance preference(EPP)
> registers.
> The EPP is used in the CCLK DPM controller to drive the frequency
> that a core is going to operate during short periods of activity.
> The SOC EPP targets are configured on a scale from 0 to 255 where 0
> represents maximum performance and 255 represents maximum efficiency.

So this is the other way around from "perf" values, where aiui 0xff is
"highest"?

> @@ -537,6 +537,12 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
>  * `amd-cppc` selects ACPI Collaborative Performance and Power Control (CPPC)
>    on supported AMD hardware to provide finer grained frequency control
>    mechanism. The default is disabled.
> +* `active` is to enable amd-cppc driver in active(autonomous) mode. In this
> +  mode, users could write to energy performance preference register to tell
> +  hardware if they want to bias toward performance or energy efficiency. Then
> +  built-in CPPC power algorithm will calculate the runtime workload and adjust
> +  the realtime cores frequency automatically according to the power supply and

What are "the realtime cores"?

> +  thermal, core voltage and some other hardware conditions.

I think there better would be only one "and" in the enumeration of conditions.

> @@ -261,7 +276,20 @@ static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
>          return res;
>  
>      return amd_cppc_write_request(policy->cpu, data->caps.lowest_nonlinear_perf,
> -                                  des_perf, data->caps.highest_perf);
> +                                  des_perf, data->caps.highest_perf,
> +                                  /* Pre-defined BIOS value for passive mode */
> +                                  per_cpu(epp_init, policy->cpu));
> +}
> +
> +static int read_epp_init(void)
> +{
> +    uint64_t val;
> +
> +    if ( rdmsr_safe(MSR_AMD_CPPC_REQ, val) )
> +        return -EINVAL;

I'm unconvinced of using rdmsr_safe() everywhere (i.e. this also goes for earlier
patches). Unless you can give a halfway reasonable scenario under which by the
time we get here there's still a chance that the MSR isn't implemented in the
next lower layer (hardware or another hypervisor, just to explain what's meant,
without me assuming that the driver should come into play in the first place when
we run virtualized ourselves).

Furthermore you call this function unconditionally, i.e. if there was a chance
for the MSR read to fail, CPU init would needlessly fail when in passive mode.

> +    this_cpu(epp_init) = (val >> 24) & 0xFF;

Please can you #define a suitable mask constant in msr-index.h, such that you can
use MASK_EXTR() here?

> @@ -411,12 +441,78 @@ static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
>  
>      amd_cppc_boost_init(policy, data);
>  
> +    return 0;
> +}
> +
> +static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
> +{
> +    int ret;
> +
> +    ret = amd_cppc_cpufreq_init_perf(policy);
> +    if ( ret )
> +        return ret;
> +
>      amd_cppc_verbose("CPU %u initialized with amd-cppc passive mode\n",
>                       policy->cpu);
>  
>      return 0;
>  }
>  
> +static int cf_check amd_cppc_epp_cpu_init(struct cpufreq_policy *policy)
> +{
> +    int ret;
> +
> +    ret = amd_cppc_cpufreq_init_perf(policy);
> +    if ( ret )
> +        return ret;
> +
> +    policy->policy = cpufreq_parse_policy(policy->governor);
> +
> +    amd_cppc_verbose("CPU %u initialized with amd-cppc active mode\n", policy->cpu);
> +
> +    return 0;
> +}
> +
> +static int amd_cppc_epp_update_limit(const struct cpufreq_policy *policy)
> +{
> +    const struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data,
> +                                                    policy->cpu);

Nit: Indentation is off by one here.

> +    uint8_t max_perf, min_perf, epp;
> +
> +    /* Initial min/max values for CPPC Performance Controls Register */
> +    /*
> +     * Continuous CPPC performance scale in active mode is [lowest_perf,
> +     * highest_perf]
> +     */
> +    max_perf = data->caps.highest_perf;
> +    min_perf = data->caps.lowest_perf;
> +
> +    epp = per_cpu(epp_init, policy->cpu);
> +    if ( policy->policy == CPUFREQ_POLICY_PERFORMANCE )

This may want to be switch() instead.

> +    {
> +        /* Force the epp value to be zero for performance policy */
> +        epp = CPPC_ENERGY_PERF_MAX_PERFORMANCE;
> +        min_perf = max_perf;
> +    }
> +    else if ( policy->policy == CPUFREQ_POLICY_POWERSAVE )
> +        /* Force the epp value to be 0xff for powersave policy */
> +        /*
> +         * If set max_perf = min_perf = lowest_perf, we are putting
> +         * cpu cores in idle.
> +         */

Nit: Such two successive comments want combining. (Same near the top of the
function, as I notice only now.)

Furthermore I'm in trouble with interpreting this comment: To me "lowest"
doesn't mean "doing nothing" but "doing things as efficiently in terms of
power use as possible". IOW that's not idle. Yet the comment reads as if it
was meant to be an explanation of why we can't set max_perf from min_perf
here. That is, not matter what's meant to be said, I think this needs re-
wording (and possibly using subjunctive mood).

> +        epp = CPPC_ENERGY_PERF_MAX_POWERSAVE;
> +
> +    return amd_cppc_write_request(policy->cpu, min_perf,
> +                                  /* des_perf = 0 for epp mode */
> +                                  0,

The comment could do with putting on the same line as the 0, e.g.
(slightly adjusted)

    return amd_cppc_write_request(policy->cpu, min_perf,
                                  0 /* no des_perf for epp mode */,
                                  max_perf, epp);

> +static int cf_check amd_cppc_epp_set_policy(struct cpufreq_policy *policy)
> +{
> +    return amd_cppc_epp_update_limit(policy);
> +}

So the purpose of this wrapper is solely to have the actual function's
parameter be pointer-to-const? I don't think that's worth it; I also don't
think we do such elsewhere.

> --- a/xen/drivers/cpufreq/utility.c
> +++ b/xen/drivers/cpufreq/utility.c
> @@ -491,3 +491,14 @@ int __cpufreq_set_policy(struct cpufreq_policy *data,
>  
>      return __cpufreq_governor(data, CPUFREQ_GOV_LIMITS);
>  }
> +
> +unsigned int cpufreq_parse_policy(const struct cpufreq_governor *gov)
> +{
> +    if ( !strncasecmp(gov->name, "performance", CPUFREQ_NAME_LEN) )
> +        return CPUFREQ_POLICY_PERFORMANCE;
> +
> +    if ( !strncasecmp(gov->name, "powersave", CPUFREQ_NAME_LEN) )
> +        return CPUFREQ_POLICY_POWERSAVE;
> +
> +    return CPUFREQ_POLICY_UNKNOWN;
> +}

Hmm, this isn't really parsing (in the sense of dealing with e.g. command
line elements). Maybe cpufreq_get_policy() or, more explicitly,
cpufreq_policy_from_governor()? Or something along these lines?

I also don't see why the more expensive case-insensitive comparison
routine needs using here.

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 07/15] xen/cpufreq: fix core frequency calculation for AMD Family 1Ah CPUs
  2025-03-24 15:47   ` Jan Beulich
@ 2025-03-25 10:55     ` Nicola Vetrini
  2025-03-26  9:54     ` Penny, Zheng
  1 sibling, 0 replies; 60+ messages in thread
From: Nicola Vetrini @ 2025-03-25 10:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Penny Zheng, ray.huang, Andrew Cooper, Roger Pau Monné,
	xen-devel

On 2025-03-24 16:47, Jan Beulich wrote:
> On 06.03.2025 09:39, Penny Zheng wrote:
>> This commit fixes core frequency calculation for AMD Family 1Ah CPUs, 
>> due to
>> a change in the PStateDef MSR layout in AMD Family 1Ah+.
>> In AMD Family 1Ah+, Core current operating frequency in MHz is 
>> calculated as
>> follows:
> 

[...]

> 
>> @@ -658,19 +670,20 @@ void amd_log_freq(const struct cpuinfo_x86 *c)
>>  	if (!(lo >> 63))
>>  		return;
>> 
>> -#define FREQ(v) (c->x86 < 0x17 ? ((((v) & 0x3f) + 0x10) * 100) >> 
>> (((v) >> 6) & 7) \
>> -		                     : (((v) & 0xff) * 25 * 8) / (((v) >> 8) & 
>> 0x3f))
>>  	if (idx && idx < h &&
>>  	    !rdmsr_safe(0xC0010064 + idx, val) && (val >> 63) &&
>>  	    !rdmsr_safe(0xC0010064, hi) && (hi >> 63))
>>  		printk("CPU%u: %lu (%lu ... %lu) MHz\n",
>> -		       smp_processor_id(), FREQ(val), FREQ(lo), FREQ(hi));
>> +		       smp_processor_id(),
>> +		       amd_parse_freq(c, val),
>> +		       amd_parse_freq(c, lo), amd_parse_freq(c, hi));
> 
> I fear Misra won't like multiple function calls to evaluate the 
> parameters
> to pass to another function. Iirc smp_process_id() has special 
> exception,
> so that's okay here. This may be possible to alleviate by marking the 
> new
> helper pure or even const (see gcc doc as to caveats with passing 
> pointers
> to const functions). Cc-ing Nicola for possible clarification or 
> correction.
> 
> Jan

Yes, it would help. Currently there is only a property for 
smp_processor_id(), though there has been some discussion in the past 
about adding a formal deviation. Not a big problem either way since 
currently the rule is non-blocking, but definitely an attribute would 
help any future work on making that clean.

-- 
Nicola Vetrini, B.Sc.
Software Engineer
BUGSENG (https://bugseng.com)
LinkedIn: https://www.linkedin.com/in/nicola-vetrini-a42471253


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 14/15] xen/xenpm: Adapt cpu frequency monitor in xenpm
  2025-03-06  8:39 ` [PATCH v3 14/15] xen/xenpm: Adapt cpu frequency monitor in xenpm Penny Zheng
@ 2025-03-25 11:26   ` Jan Beulich
  2025-03-25 16:37     ` Jason Andryuk
  2025-03-26 15:45     ` Anthony PERARD
  0 siblings, 2 replies; 60+ messages in thread
From: Jan Beulich @ 2025-03-25 11:26 UTC (permalink / raw)
  To: Penny Zheng, Jason Andryuk, Anthony PERARD
  Cc: ray.huang, Juergen Gross, xen-devel

On 06.03.2025 09:39, Penny Zheng wrote:
> Make `xenpm get-cpureq-para/set-cpufreq-para` available in CPPC mode.
> --- a/tools/libs/ctrl/xc_pm.c
> +++ b/tools/libs/ctrl/xc_pm.c
> @@ -214,13 +214,12 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
>  			 user_para->gov_num * CPUFREQ_NAME_LEN * sizeof(char), XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
>  
>      bool has_num = user_para->cpu_num &&
> -                     user_para->freq_num &&
>                       user_para->gov_num;
>  
>      if ( has_num )

Something looks wrong here already before your patch: With how has_num is set
and with this conditional, ...

>      {
>          if ( (!user_para->affected_cpus)                    ||
> -             (!user_para->scaling_available_frequencies)    ||
> +             (user_para->freq_num && !user_para->scaling_available_frequencies)    ||
>               (user_para->gov_num && !user_para->scaling_available_governors) )

... this ->gov_num check, ...

>          {
>              errno = EINVAL;
> @@ -228,14 +227,16 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
>          }
>          if ( xc_hypercall_bounce_pre(xch, affected_cpus) )
>              goto unlock_1;
> -        if ( xc_hypercall_bounce_pre(xch, scaling_available_frequencies) )
> +        if ( user_para->freq_num &&
> +             xc_hypercall_bounce_pre(xch, scaling_available_frequencies) )
>              goto unlock_2;
>          if ( user_para->gov_num &&

... this one, and ...

>               xc_hypercall_bounce_pre(xch, scaling_available_governors) )
>              goto unlock_3;
>  
>          set_xen_guest_handle(sys_para->affected_cpus, affected_cpus);
> -        set_xen_guest_handle(sys_para->scaling_available_frequencies, scaling_available_frequencies);
> +        if ( user_para->freq_num )
> +            set_xen_guest_handle(sys_para->scaling_available_frequencies, scaling_available_frequencies);

(Nit: Yet another overly long line. It was too long already before, yes, but
 that's no excuse to make it even longer.  The more that there is better
 formatting right in context below.)

>          if ( user_para->gov_num )

... this one are all dead code. Jason? I expect the has_num variable simply
wants dropping altogether, thus correcting the earlier anomaly and getting
the intended new behavior at the same time.

>              set_xen_guest_handle(sys_para->scaling_available_governors,
>                                   scaling_available_governors);

This is the piece of context I'm referring to in the nit above.

> @@ -301,7 +302,8 @@ unlock_4:
>      if ( user_para->gov_num )
>          xc_hypercall_bounce_post(xch, scaling_available_governors);
>  unlock_3:
> -    xc_hypercall_bounce_post(xch, scaling_available_frequencies);
> +    if ( user_para->freq_num )
> +        xc_hypercall_bounce_post(xch, scaling_available_frequencies);
>  unlock_2:
>      xc_hypercall_bounce_post(xch, affected_cpus);
>  unlock_1:

I'm also puzzled by the function's inconsistent return value - Anthony,
can you explain / spot why things are the way they are?

> --- a/tools/misc/xenpm.c
> +++ b/tools/misc/xenpm.c
> @@ -539,7 +539,7 @@ static void signal_int_handler(int signo)
>                          res / 1000000UL, 100UL * res / (double)sum_px[i]);
>              }
>          }
> -        if ( px_cap && avgfreq[i] )
> +        if ( avgfreq[i] )
>              printf("  Avg freq\t%d\tKHz\n", avgfreq[i]);
>      }

I wonder whether this shouldn't be an independent change (which then
could go in rather sooner).

> @@ -926,7 +926,8 @@ static int show_cpufreq_para_by_cpuid(xc_interface *xc_handle, int cpuid)
>              ret = -ENOMEM;
>              goto out;
>          }
> -        if (!(p_cpufreq->scaling_available_frequencies =
> +        if (p_cpufreq->freq_num &&
> +            !(p_cpufreq->scaling_available_frequencies =
>                malloc(p_cpufreq->freq_num * sizeof(uint32_t))))
>          {
>              fprintf(stderr,

Can someone explain to me how the pre-existing logic here works? All
three ->*_num start out as zero. Hence respective allocations (of zero
size) may conceivably return NULL (the behavior there is implementation
defined after all). Yet then we'd bail from the loop, and hence from the
function. IOW adding a ->freq_num check and also a ->cpu_num one (along
with the ->gov_num one that apparently was added during HWP development)
would once again look like an independent (latent) bugfix to me.

> --- a/xen/drivers/acpi/pmstat.c
> +++ b/xen/drivers/acpi/pmstat.c
> @@ -202,7 +202,7 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
>      pmpt = processor_pminfo[op->cpuid];
>      policy = per_cpu(cpufreq_cpu_policy, op->cpuid);
>  
> -    if ( !pmpt || !pmpt->perf.states ||
> +    if ( !pmpt || ((pmpt->init & XEN_PX_INIT) && !pmpt->perf.states) ||
>           !policy || !policy->governor )
>          return -EINVAL;

Wouldn't this change better belong in the earlier patch, where the code
in context of the last hunk below was adjusted?

> @@ -229,17 +229,20 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
>      if ( ret )
>          return ret;
>  
> -    if ( !(scaling_available_frequencies =
> -           xzalloc_array(uint32_t, op->u.get_para.freq_num)) )
> -        return -ENOMEM;
> -    for ( i = 0; i < op->u.get_para.freq_num; i++ )
> -        scaling_available_frequencies[i] =
> -                        pmpt->perf.states[i].core_frequency * 1000;
> -    ret = copy_to_guest(op->u.get_para.scaling_available_frequencies,
> -                   scaling_available_frequencies, op->u.get_para.freq_num);
> -    xfree(scaling_available_frequencies);
> -    if ( ret )
> -        return ret;
> +    if ( op->u.get_para.freq_num )
> +    {
> +        if ( !(scaling_available_frequencies =
> +               xzalloc_array(uint32_t, op->u.get_para.freq_num)) )
> +            return -ENOMEM;
> +        for ( i = 0; i < op->u.get_para.freq_num; i++ )
> +            scaling_available_frequencies[i] =
> +                            pmpt->perf.states[i].core_frequency * 1000;

Nit: Indentation was bogus here and ...

> +        ret = copy_to_guest(op->u.get_para.scaling_available_frequencies,
> +                    scaling_available_frequencies, op->u.get_para.freq_num);

... here before, and sadly continues to be bogus now.

> +        xfree(scaling_available_frequencies);
> +        if ( ret )
> +            return ret;
> +    }

While (beyond the nit above) I'm okay with this simple change, I think the
code here would benefit from folding the two allocations into one. There
simply is no reason to pay the price of the allocation overhead twice, when
we need a uint32_t[max(.cpu_num, .freq_num)] array anyway. That way the
churn introduced here would then also be smaller.

> @@ -465,7 +468,8 @@ int do_pm_op(struct xen_sysctl_pm_op *op)
>      switch ( op->cmd & PM_PARA_CATEGORY_MASK )
>      {
>      case CPUFREQ_PARA:
> -        if ( !(xen_processor_pmbits & XEN_PROCESSOR_PM_PX) )
> +        if ( !(xen_processor_pmbits & (XEN_PROCESSOR_PM_PX |
> +                                       XEN_PROCESSOR_PM_CPPC)) )
>              return -ENODEV;
>          if ( !pmpt || !(pmpt->init & (XEN_PX_INIT | XEN_CPPC_INIT)) )
>              return -EINVAL;

(This is the hunk I'm referring to further up.)

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 09/15] xen/x86: introduce a new amd cppc driver for cpufreq scaling
  2025-03-25  9:57   ` Jan Beulich
@ 2025-03-25 13:58     ` Jason Andryuk
  2025-04-03  7:40     ` Penny, Zheng
  1 sibling, 0 replies; 60+ messages in thread
From: Jason Andryuk @ 2025-03-25 13:58 UTC (permalink / raw)
  To: Jan Beulich, Penny Zheng
  Cc: ray.huang, Andrew Cooper, Roger Pau Monné, xen-devel,
	Jason Andryuk

On 2025-03-25 05:57, Jan Beulich wrote:
> On 06.03.2025 09:39, Penny Zheng wrote:

>> +static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
>> +                                            unsigned int target_freq,
>> +                                            unsigned int relation)
>> +{
>> +    unsigned int cpu = policy->cpu;
>> +    const struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data, cpu);
>> +    uint8_t des_perf;
>> +    int res;
>> +
>> +    if ( unlikely(!target_freq) )
>> +        return 0;
> 
> Checking other *_cpufreq_target() functions, none would silently ignore
> a zero input. (HWP's ignores the input altogether though; Cc-ing Jason
> for possible clarification: I would have expected this driver here and
> the HWP one to be similar in this regard.)

Yes, for HWP, the target and relation are ignored.  All control is done 
by writing MSR_HWP_REQUEST which are "continuous, abstract, unit-less 
performance scale" values.  Those are applied by set_hwp_para() from 
`xenpm set-cpufreq-cppc`.

I think the difference is that this CPPC driver supports both autonomous 
and active mode.  The HWP driver I wrote only supports the equivalent of 
autonomous mode - write the MSR and let the processor figure it out.

I think Penny's implementation also uses the existing governors, whereas 
HWP only uses the dedicated hwp_governor.

Hopefully that gives some context.

Regards,
Jason


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 14/15] xen/xenpm: Adapt cpu frequency monitor in xenpm
  2025-03-25 11:26   ` Jan Beulich
@ 2025-03-25 16:37     ` Jason Andryuk
  2025-03-26 15:45     ` Anthony PERARD
  1 sibling, 0 replies; 60+ messages in thread
From: Jason Andryuk @ 2025-03-25 16:37 UTC (permalink / raw)
  To: Jan Beulich, Penny Zheng, Jason Andryuk, Anthony PERARD
  Cc: ray.huang, Juergen Gross, xen-devel

On 2025-03-25 07:26, Jan Beulich wrote:
> On 06.03.2025 09:39, Penny Zheng wrote:
>> Make `xenpm get-cpureq-para/set-cpufreq-para` available in CPPC mode.
>> --- a/tools/libs/ctrl/xc_pm.c
>> +++ b/tools/libs/ctrl/xc_pm.c
>> @@ -214,13 +214,12 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
>>   			 user_para->gov_num * CPUFREQ_NAME_LEN * sizeof(char), XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
>>   
>>       bool has_num = user_para->cpu_num &&
>> -                     user_para->freq_num &&
>>                        user_para->gov_num;
>>   
>>       if ( has_num )
> 
> Something looks wrong here already before your patch: With how has_num is set
> and with this conditional, ...
> 
>>       {
>>           if ( (!user_para->affected_cpus)                    ||
>> -             (!user_para->scaling_available_frequencies)    ||
>> +             (user_para->freq_num && !user_para->scaling_available_frequencies)    ||
>>                (user_para->gov_num && !user_para->scaling_available_governors) )
> 
> ... this ->gov_num check, ...>>           {
>>               errno = EINVAL;
>> @@ -228,14 +227,16 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
>>           }
>>           if ( xc_hypercall_bounce_pre(xch, affected_cpus) )
>>               goto unlock_1;
>> -        if ( xc_hypercall_bounce_pre(xch, scaling_available_frequencies) )
>> +        if ( user_para->freq_num &&
>> +             xc_hypercall_bounce_pre(xch, scaling_available_frequencies) )
>>               goto unlock_2;
>>           if ( user_para->gov_num &&
> 
> ... this one, and ...
> 
>>                xc_hypercall_bounce_pre(xch, scaling_available_governors) )
>>               goto unlock_3;
>>   
>>           set_xen_guest_handle(sys_para->affected_cpus, affected_cpus);
>> -        set_xen_guest_handle(sys_para->scaling_available_frequencies, scaling_available_frequencies);
>> +        if ( user_para->freq_num )
>> +            set_xen_guest_handle(sys_para->scaling_available_frequencies, scaling_available_frequencies);
> 
> (Nit: Yet another overly long line. It was too long already before, yes, but
>   that's no excuse to make it even longer.  The more that there is better
>   formatting right in context below.)
> 
>>           if ( user_para->gov_num )
> 
> ... this one are all dead code. Jason? I expect the has_num variable simply
> wants dropping altogether, thus correcting the earlier anomaly and getting
> the intended new behavior at the same time.

Hmmm.  The sysctl is executed twice - first to query the assorted *_num 
values and a second time to retrieve the results with sized arrays.

get_hwp_para() does not populate scaling_available_governors, so the 
intention was to be able to skip allocating the buffer for it.

     pmstat&xenpm: Re-arrage for cpufreq union

     Rearrange code now that xen_sysctl_pm_op's get_para fields has the
     nested union and struct.  In particular, the scaling governor
     information like scaling_available_governors is inside the union, so it
     is not always available.  Move those fields (op->u.get_para.u.s.u.*)
     together as well as the common fields (ones outside the union like
     op->u.get_para.turbo_enabled).

     With that, gov_num may be 0, so bounce buffer handling needs
     to be modified.

     scaling_governor and other fields inside op->u.get_para.u.s.u.* 
won't be
     used for hwp, so this will simplify the change when hwp support is
     introduced and re-indents these lines all together.

I noted that gov_num may be 0.  But that may have been before hwp had 
its own internal governor.  But, yes, the has_num handling looks wrong 
for gov_num == 0.  I don't have a machine with hwp to verify.


>> @@ -926,7 +926,8 @@ static int show_cpufreq_para_by_cpuid(xc_interface *xc_handle, int cpuid)
>>               ret = -ENOMEM;
>>               goto out;
>>           }
>> -        if (!(p_cpufreq->scaling_available_frequencies =
>> +        if (p_cpufreq->freq_num &&
>> +            !(p_cpufreq->scaling_available_frequencies =
>>                 malloc(p_cpufreq->freq_num * sizeof(uint32_t))))
>>           {
>>               fprintf(stderr,
> 
> Can someone explain to me how the pre-existing logic here works? All
> three ->*_num start out as zero. Hence respective allocations (of zero
> size) may conceivably return NULL (the behavior there is implementation
> defined after all). Yet then we'd bail from the loop, and hence from the
> function. IOW adding a ->freq_num check and also a ->cpu_num one (along
> with the ->gov_num one that apparently was added during HWP development)
> would once again look like an independent (latent) bugfix to me.

I guess we rely on glibc providing non-NULL?  But also they are ignored 
for the initial query of *_num values.

Regards,
Jason


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 15/15] xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC xen_sysctl_pm_op for amd-cppc driver
  2025-03-06  8:39 ` [PATCH v3 15/15] xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC xen_sysctl_pm_op for amd-cppc driver Penny Zheng
@ 2025-03-25 16:59   ` Jan Beulich
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2025-03-25 16:59 UTC (permalink / raw)
  To: Penny Zheng; +Cc: ray.huang, Andrew Cooper, Roger Pau Monné, xen-devel

On 06.03.2025 09:39, Penny Zheng wrote:
> Introduce helper set_amd_cppc_para and get_amd_cppc_para to
> SET/GET CPPC-related para for amd-cppc/amd-cppc-epp driver.
> 
> Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
> ---
> v1 -> v2:
> - Give the variable des_perf an initializer of 0
> - Use the strncmp()s directly in the if()
> ---
>  xen/arch/x86/acpi/cpufreq/amd-cppc.c | 124 +++++++++++++++++++++++++++
>  xen/drivers/acpi/pmstat.c            |  20 ++++-
>  xen/include/acpi/cpufreq/cpufreq.h   |   5 ++
>  3 files changed, 145 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/arch/x86/acpi/cpufreq/amd-cppc.c b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
> index 606bb648b3..28c13b09c8 100644
> --- a/xen/arch/x86/acpi/cpufreq/amd-cppc.c
> +++ b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
> @@ -32,6 +32,7 @@
>  
>  static bool __ro_after_init opt_active_mode;
>  static DEFINE_PER_CPU_READ_MOSTLY(uint8_t, epp_init);
> +static bool __ro_after_init amd_cppc_in_use;
>  
>  struct amd_cppc_drv_data
>  {
> @@ -513,6 +514,123 @@ static int cf_check amd_cppc_epp_set_policy(struct cpufreq_policy *policy)
>      return amd_cppc_epp_update_limit(policy);
>  }
>  
> +int get_amd_cppc_para(unsigned int cpu,
> +                      struct xen_cppc_para *cppc_para)
> +{
> +    const struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data, cpu);
> +
> +    if ( data == NULL )
> +        return -ENODATA;
> +
> +    cppc_para->features         = 0;
> +    cppc_para->lowest           = data->caps.lowest_perf;
> +    cppc_para->lowest_nonlinear = data->caps.lowest_nonlinear_perf;
> +    cppc_para->nominal          = data->caps.nominal_perf;
> +    cppc_para->highest          = data->caps.highest_perf;
> +    cppc_para->minimum          = data->req.min_perf;
> +    cppc_para->maximum          = data->req.max_perf;
> +    cppc_para->desired          = data->req.des_perf;
> +    cppc_para->energy_perf      = data->req.epp;
> +
> +    return 0;
> +}
> +
> +int set_amd_cppc_para(const struct cpufreq_policy *policy,
> +                      const struct xen_set_cppc_para *set_cppc)
> +{
> +    unsigned int cpu = policy->cpu;
> +    struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data, cpu);
> +    uint8_t max_perf, min_perf, des_perf = 0, epp;
> +
> +    if ( data == NULL )
> +        return -ENOENT;
> +
> +    /* Validate all parameters - Disallow reserved bits. */
> +    if ( set_cppc->minimum > UINT8_MAX || set_cppc->maximum > UINT8_MAX ||
> +         set_cppc->desired > UINT8_MAX || set_cppc->energy_perf > UINT8_MAX )
> +        return -EINVAL;
> +
> +    /* Only allow values if params bit is set. */
> +    if ( (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED) &&
> +          set_cppc->desired) ||
> +         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MINIMUM) &&
> +          set_cppc->minimum) ||
> +         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MAXIMUM) &&
> +          set_cppc->maximum) ||
> +         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ENERGY_PERF) &&
> +          set_cppc->energy_perf) )
> +        return -EINVAL;
> +
> +    /* Activity window not supported in MSR */
> +    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ACT_WINDOW )
> +        return -EOPNOTSUPP;
> +
> +    /* Return if there is nothing to do. */
> +    if ( set_cppc->set_params == 0 )
> +        return 0;
> +
> +    epp = per_cpu(epp_init, cpu);
> +    /* Apply presets */
> +    /*
> +     * XEN_SYSCTL_CPPC_SET_PRESET_POWERSAVE/PERFORMANCE/BALANCE are
> +     * for amd-cppc in active mode, min_perf could be set with lowest_perf
> +     * representing the T-state range of performance levels, while
> +     * XEN_SYSCTL_CPPC_SET_PRESET_NONE is for amd-cppc in passive mode, it
> +     * depends on governor to do performance scaling, setting with
> +     * lowest_nonlinear_perf to ensures performance in P-state range.
> +     */

Nit: There are again two consecutive comments here.

The active / passive mode distinction mentioned in the comment isn't
reflected anywhere in the code. It's the XEN_SYSCTL_CPPC_SET_DESIRED
which distinguishes them, yet that flag isn't mentioned in the comment.

> +    switch ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_PRESET_MASK )
> +    {
> +    case XEN_SYSCTL_CPPC_SET_PRESET_POWERSAVE:
> +        if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED )
> +            return -EINVAL;
> +        min_perf = data->caps.lowest_perf;
> +        max_perf = data->caps.highest_perf;

These are still not not both ".lowest_perf", and I still don't understand
- due to the lack of a comment - why that is.

> +        epp = CPPC_ENERGY_PERF_MAX_POWERSAVE;
> +        break;
> +
> +    case XEN_SYSCTL_CPPC_SET_PRESET_PERFORMANCE:
> +        if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED )
> +            return -EINVAL;
> +        min_perf = data->caps.highest_perf;
> +        max_perf = data->caps.highest_perf;
> +        epp = CPPC_ENERGY_PERF_MAX_PERFORMANCE;
> +        break;
> +
> +    case XEN_SYSCTL_CPPC_SET_PRESET_BALANCE:
> +        if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED )
> +            return -EINVAL;
> +        min_perf = data->caps.lowest_perf;
> +        max_perf = data->caps.highest_perf;
> +        epp = CPPC_ENERGY_PERF_BALANCE;
> +        break;
> +
> +    case XEN_SYSCTL_CPPC_SET_PRESET_NONE:
> +        min_perf = data->caps.lowest_nonlinear_perf;
> +        max_perf = data->caps.highest_perf;
> +        break;
> +
> +    default:
> +        return -EINVAL;
> +    }
> +
> +    /* Further customize presets if needed */
> +    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MINIMUM )
> +        min_perf = set_cppc->minimum;
> +
> +    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MAXIMUM )
> +        max_perf = set_cppc->maximum;
> +
> +    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ENERGY_PERF )
> +        epp = set_cppc->energy_perf;
> +
> +    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED )
> +        des_perf = set_cppc->desired;

Considering these I even less understand what the comment further up is
about.

> +    return amd_cppc_write_request(cpu, min_perf, des_perf, max_perf, epp);
> +}
> +
> +

Nit (not for the first time, I think): No double blank lines please.

> @@ -533,6 +651,11 @@ amd_cppc_epp_driver =
>      .exit       = amd_cppc_cpufreq_cpu_exit,
>  };
>  
> +bool amd_cppc_active(void)
> +{
> +    return amd_cppc_in_use;
> +}
> +
>  int __init amd_cppc_register_driver(void)
>  {
>      int ret;
> @@ -552,6 +675,7 @@ int __init amd_cppc_register_driver(void)
>  
>      /* Remove possible fallback option */
>      xen_processor_pmbits &= ~XEN_PROCESSOR_PM_PX;
> +    amd_cppc_in_use = true;

Is this separate flag really needed? Can't you go from xen_processor_pmbits?

> --- a/xen/drivers/acpi/pmstat.c
> +++ b/xen/drivers/acpi/pmstat.c
> @@ -261,7 +261,16 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
>           !strncmp(op->u.get_para.scaling_driver, XEN_HWP_DRIVER_NAME,
>                    CPUFREQ_NAME_LEN) )
>          ret = get_hwp_para(policy->cpu, &op->u.get_para.u.cppc_para);
> -    else
> +    else if ( !strncmp(op->u.get_para.scaling_driver, XEN_AMD_CPPC_DRIVER_NAME,
> +                       CPUFREQ_NAME_LEN) ||
> +              !strncmp(op->u.get_para.scaling_driver, XEN_AMD_CPPC_EPP_DRIVER_NAME,

Overlong line again.

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v3 03/15] xen/cpufreq: refactor cmdline "cpufreq=xxx"
  2025-03-24 15:00   ` Jan Beulich
@ 2025-03-26  7:20     ` Penny, Zheng
  2025-03-26 10:43       ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Penny, Zheng @ 2025-03-26  7:20 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Huang, Ray, Andrew Cooper, Anthony PERARD, Orzel, Michal,
	Julien Grall, Roger Pau Monné, Stefano Stabellini,
	xen-devel@lists.xenproject.org

[Public]

Hi,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, March 24, 2025 11:01 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Anthony PERARD <anthony.perard@vates.tech>;
> Orzel, Michal <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Roger Pau
> Monné <roger.pau@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>; xen-
> devel@lists.xenproject.org
> Subject: Re: [PATCH v3 03/15] xen/cpufreq: refactor cmdline "cpufreq=xxx"
>
> On 06.03.2025 09:39, Penny Zheng wrote:
> > --- a/docs/misc/xen-command-line.pandoc
> > +++ b/docs/misc/xen-command-line.pandoc
> > @@ -535,7 +535,8 @@ choice of `dom0-kernel` is deprecated and not supported
> by all Dom0 kernels.
> >    processor to autonomously force physical package components into idle state.
> >    The default is enabled, but the option only applies when `hwp` is enabled.
> >
> > -There is also support for `;`-separated fallback options:
> > +User could use `;`-separated options to support universal options
> > +which they would like to try on any agnostic platform, *but* under
> > +priority order, like
> >  `cpufreq=hwp;xen,verbose`.  This first tries `hwp` and falls back to
> > `xen` if  unavailable.  Note: The `verbose` suboption is handled
> > globally.  Setting it  for either the primary or fallback option
> > applies to both irrespective of where
>
> What does "support" here mean? I fear I can't even suggest what else to use, as I
> don't follow what additional information you mean to add here. Is a change here
> really needed?
>

There are two changes I'd like to address:
1) ";" is not designed for fallback options anymore, like we discussed before, we would
like to support something like "cpufreq=hwp;amd-cppc;xen" for users to define all universal options
they would like to try.
2) Must under *priority* order. As in cpufreq_driver_init(), we are using loop to decide which driver to
try firstly. If user defines "cpufreq=xen;amd-cppc", which leads legacy P-state set before amd-cppc in cpufreq_xen_opts[],
then in the loop, we will try to register legacy P-state firstly, once it gets registered successfully, we will not try to register amd-cppc at all.

> > --- a/xen/drivers/cpufreq/cpufreq.c
> > +++ b/xen/drivers/cpufreq/cpufreq.c
> > +    if ( arg[0] && arg[1] )
> > +        ret = cpufreq_cmdline_parse(arg + 1, end);
> > +
> > +    return ret;
> > +}
> > +
> > +static int __init cpufreq_cmdline_parse_hwp(const char *arg, const
> > +char *end) {
> > +    int ret = 0;
> > +
> > +    xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
> > +    cpufreq_controller = FREQCTL_xen;
> > +    cpufreq_xen_opts[cpufreq_xen_cnt++] = CPUFREQ_hwp;
> > +    if ( arg[0] && arg[1] )
> > +        ret = hwp_cmdline_parse(arg + 1, end);
> > +
> > +    return ret;
> > +}
>
> For both of the helpers may I suggest s/parse/process/ or some such ("handle"
> might be another possible term to use), as themselves they don't do any parsing?
>

Maybe I mis-understood the previous comment you said
```
        >          else if ( IS_ENABLED(CONFIG_INTEL) && choice < 0 &&
        > ```

        For the rest of this, I guess I'd prefer to see this in context. Also with
        regard to the helper function's name.
```
I thought you suggested to introduce helper function to wrap the conditional codes...
Or may you were suggesting something like:
```
#ifdef CONFIG_INTEL
else if ( choice < 0 && !cmdline_strcmp(str, "hwp") )
{
    xen_processor_pmbits |= XEN_PROCES
    ...
}
#endif
```

> In the end I'm also not entirely convinced that we need these two almost identical
> helpers (with a 3rd likely appearing in a later patch).
>

> Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v3 04/15] xen/cpufreq: move XEN_PROCESSOR_PM_xxx to internal header
  2025-03-24 15:11   ` Jan Beulich
@ 2025-03-26  7:48     ` Penny, Zheng
  0 siblings, 0 replies; 60+ messages in thread
From: Penny, Zheng @ 2025-03-26  7:48 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Huang, Ray, Andrew Cooper, Roger Pau Monné, Anthony PERARD,
	Orzel, Michal, Julien Grall, Stefano Stabellini,
	xen-devel@lists.xenproject.org

[Public]

Hi

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, March 24, 2025 11:12 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Roger Pau Monné <roger.pau@citrix.com>;
> Anthony PERARD <anthony.perard@vates.tech>; Orzel, Michal
> <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Stefano Stabellini
> <sstabellini@kernel.org>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v3 04/15] xen/cpufreq: move XEN_PROCESSOR_PM_xxx
> to internal header
>
> On 06.03.2025 09:39, Penny Zheng wrote:
> > XEN_PROCESSOR_PM_xxx are used to set xen_processor_pmbits only, which
> > is a Xen-internal variable only. Although PV Dom0 passed these bits in
> > si->flags, they haven't been used anywhere.
>
> Please be careful with "not used anywhere". See e.g.
> https://xenbits.xen.org/gitweb/?p=legacy/linux-2.6.18-
> xen.git;a=blob;f=arch/i386/kernel/acpi/processor_extcntl_xen.c;h=eb6a53e9572c13
> 7da505a7d4970b1a5b7e1c522d;hb=HEAD#l193
>
> > So this commit moves XEN_PROCESSOR_PM_xxx back to internal header
> > "acpi/cpufreq/processor_perf.h"
>
> Essentially you're again altering the stable public ABI in a way that's not acceptable.
>

Understood...
I misunderstood the previous comment again...
I'll only move  the new XEN_PROCESSOR_PM_CPPC into the internal header

> Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v3 05/15] xen/x86: introduce "cpufreq=amd-cppc" xen cmdline
  2025-03-24 15:26   ` Jan Beulich
@ 2025-03-26  8:35     ` Penny, Zheng
  2025-03-26 10:55       ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Penny, Zheng @ 2025-03-26  8:35 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Huang, Ray, Andrew Cooper, Anthony PERARD, Orzel, Michal,
	Julien Grall, Roger Pau Monné, Stefano Stabellini,
	xen-devel@lists.xenproject.org

[Public]

Hi,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, March 24, 2025 11:26 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Anthony PERARD <anthony.perard@vates.tech>;
> Orzel, Michal <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Roger Pau
> Monné <roger.pau@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>; xen-
> devel@lists.xenproject.org
> Subject: Re: [PATCH v3 05/15] xen/x86: introduce "cpufreq=amd-cppc" xen
> cmdline
>
> On 06.03.2025 09:39, Penny Zheng wrote:
> > @@ -514,5 +515,14 @@ acpi_cpufreq_driver = {
> >
> >  int __init acpi_cpufreq_register(void)  {
> > -    return cpufreq_register_driver(&acpi_cpufreq_driver);
> > +    int ret;
> > +
> > +    ret = cpufreq_register_driver(&acpi_cpufreq_driver);
> > +    if ( ret )
> > +        return ret;
> > +
> > +    if ( IS_ENABLED(CONFIG_AMD) )
> > +        xen_processor_pmbits &= ~XEN_PROCESSOR_PM_CPPC;
>
> What's the purpose of the if() here?

After cpufreq driver properly registered, I'd like XEN_PROCESSOR_PM_PX and XEN_PROCESSOR_PM_CPPC
being exclusive value to represent the actual underlying registered driver.
As users could define something like "cpufreq=amd-cppc,xen", which implies both XEN_PROCESSOR_PM_PX and XEN_PROCESSOR_PM_CPPC
got set in parsing logic. With amd-cppc failing to register, we are falling back to legacy ones. Then XEN_PROCESSOR_PM_CPPC needs to clear.

>
> > @@ -157,7 +161,35 @@ static int __init cf_check
> > cpufreq_driver_init(void)
> >
> >          case X86_VENDOR_AMD:
> >          case X86_VENDOR_HYGON:
> > -            ret = IS_ENABLED(CONFIG_AMD) ? powernow_register_driver() : -
> ENODEV;
> > +            if ( !IS_ENABLED(CONFIG_AMD) )
> > +            {
> > +                ret = -ENODEV;
> > +                break;
> > +            }
> > +            ret = -ENOENT;
> > +
> > +            for ( unsigned int i = 0; i < cpufreq_xen_cnt; i++ )
> > +            {
> > +                switch ( cpufreq_xen_opts[i] )
> > +                {
> > +                case CPUFREQ_xen:
> > +                    ret = powernow_register_driver();
> > +                    break;
> > +                case CPUFREQ_amd_cppc:
> > +                    ret = amd_cppc_register_driver();
> > +                    break;
> > +                case CPUFREQ_none:
> > +                    ret = 0;
> > +                    break;
> > +                default:
> > +                    printk(XENLOG_WARNING
> > +                           "Unsupported cpufreq driver for vendor
> > + AMD\n");
>
> What about Hygon?
>
> > --- a/xen/include/acpi/cpufreq/cpufreq.h
> > +++ b/xen/include/acpi/cpufreq/cpufreq.h
> > @@ -28,6 +28,7 @@ enum cpufreq_xen_opt {
> >      CPUFREQ_none,
> >      CPUFREQ_xen,
> >      CPUFREQ_hwp,
> > +    CPUFREQ_amd_cppc,
> >  };
> >  extern enum cpufreq_xen_opt cpufreq_xen_opts[2];
>
> I'm pretty sure I pointed out before that this array needs to grow, now that you add a
> 3rd kind of handling.
>

Hmmm, but the CPUFREQ_hwp and CPUFREQ_amd_cppc are incompatible options.
I thought cpufreq_xen_opts[] shall reflect available choices on their hardware.
Even if users define "cpufreq=hwp;amd-cppc;xen", in Intel platform, cpufreq_xen_opts[] shall
contain  CPUFREQ_hwp and CPUFREQ_xen, while in amd platform, cpufreq_xen_opts[] shall
contain CPUFREQ_amd_cppc and CPUFREQ_xen

> Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v3 07/15] xen/cpufreq: fix core frequency calculation for AMD Family 1Ah CPUs
  2025-03-24 15:47   ` Jan Beulich
  2025-03-25 10:55     ` Nicola Vetrini
@ 2025-03-26  9:54     ` Penny, Zheng
  2025-03-26 10:14       ` Nicola Vetrini
  1 sibling, 1 reply; 60+ messages in thread
From: Penny, Zheng @ 2025-03-26  9:54 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Huang, Ray, Andrew Cooper, Roger Pau Monné,
	xen-devel@lists.xenproject.org, Nicola Vetrini

[Public]

Hi,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, March 24, 2025 11:48 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Roger Pau Monné <roger.pau@citrix.com>; xen-
> devel@lists.xenproject.org; Nicola Vetrini <nicola.vetrini@bugseng.com>
> Subject: Re: [PATCH v3 07/15] xen/cpufreq: fix core frequency calculation for AMD
> Family 1Ah CPUs
>
> On 06.03.2025 09:39, Penny Zheng wrote:
> > This commit fixes core frequency calculation for AMD Family 1Ah CPUs,
> > due to a change in the PStateDef MSR layout in AMD Family 1Ah+.
> > In AMD Family 1Ah+, Core current operating frequency in MHz is
> > calculated as
> > follows:
>
> Why 1Ah+? In the code you correctly limit to just 1Ah.
>
> > --- a/xen/arch/x86/cpu/amd.c
> > +++ b/xen/arch/x86/cpu/amd.c
> > @@ -572,12 +572,24 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
> >                                                            :
> > c->cpu_core_id);  }
> >
> > +static uint64_t amd_parse_freq(const struct cpuinfo_x86 *c, uint64_t
> > +value) {
> > +   ASSERT(c->x86 <= 0x1A);
> > +
> > +   if (c->x86 < 0x17)
> > +           return (((value & 0x3f) + 0x10) * 100) >> ((value >> 6) & 7);
> > +   else if (c->x86 <= 0x19)
> > +           return ((value & 0xff) * 25 * 8) / ((value >> 8) & 0x3f);
> > +   else
> > +           return (value & 0xfff) * 5;
> > +}
>
> Could I talk you into omitting the unnecessary "else" in cases like this one?
> (This may also make sense to express as switch().)
>

Sorry, bad habit... will change it to switch

> > @@ -658,19 +670,20 @@ void amd_log_freq(const struct cpuinfo_x86 *c)
> >     if (!(lo >> 63))
> >             return;
> >
> > -#define FREQ(v) (c->x86 < 0x17 ? ((((v) & 0x3f) + 0x10) * 100) >> (((v) >> 6) &
> 7) \
> > -                                : (((v) & 0xff) * 25 * 8) / (((v) >> 8) & 0x3f))
> >     if (idx && idx < h &&
> >         !rdmsr_safe(0xC0010064 + idx, val) && (val >> 63) &&
> >         !rdmsr_safe(0xC0010064, hi) && (hi >> 63))
> >             printk("CPU%u: %lu (%lu ... %lu) MHz\n",
> > -                  smp_processor_id(), FREQ(val), FREQ(lo), FREQ(hi));
> > +                  smp_processor_id(),
> > +                  amd_parse_freq(c, val),
> > +                  amd_parse_freq(c, lo), amd_parse_freq(c, hi));
>
> I fear Misra won't like multiple function calls to evaluate the parameters to pass to
> another function. Iirc smp_process_id() has special exception, so that's okay here.
> This may be possible to alleviate by marking the new helper pure or even const
> (see gcc doc as to caveats with passing pointers to const functions). Cc-ing Nicola
> for possible clarification or correction.
>

Maybe we shall declare the function __pure. Having checked the gcc doc,
``
a function that has pointer arguments must not be declared const
``
Otherwise we store the "c->x86" value to avoid using the pointer

> Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 07/15] xen/cpufreq: fix core frequency calculation for AMD Family 1Ah CPUs
  2025-03-26  9:54     ` Penny, Zheng
@ 2025-03-26 10:14       ` Nicola Vetrini
  2025-03-26 10:19         ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Nicola Vetrini @ 2025-03-26 10:14 UTC (permalink / raw)
  To: Penny, Zheng
  Cc: Jan Beulich, Huang, Ray, Andrew Cooper, Roger Pau Monné,
	xen-devel

On 2025-03-26 10:54, Penny, Zheng wrote:
> [Public]
> 
> Hi,
> 
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Monday, March 24, 2025 11:48 PM
>> To: Penny, Zheng <penny.zheng@amd.com>
>> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
>> <andrew.cooper3@citrix.com>; Roger Pau Monné <roger.pau@citrix.com>; 
>> xen-
>> devel@lists.xenproject.org; Nicola Vetrini 
>> <nicola.vetrini@bugseng.com>
>> Subject: Re: [PATCH v3 07/15] xen/cpufreq: fix core frequency 
>> calculation for AMD
>> Family 1Ah CPUs
>> 
>> On 06.03.2025 09:39, Penny Zheng wrote:
>> > This commit fixes core frequency calculation for AMD Family 1Ah CPUs,
>> > due to a change in the PStateDef MSR layout in AMD Family 1Ah+.
>> > In AMD Family 1Ah+, Core current operating frequency in MHz is
>> > calculated as
>> > follows:
>> 
>> Why 1Ah+? In the code you correctly limit to just 1Ah.
>> 
>> > --- a/xen/arch/x86/cpu/amd.c
>> > +++ b/xen/arch/x86/cpu/amd.c
>> > @@ -572,12 +572,24 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
>> >                                                            :
>> > c->cpu_core_id);  }
>> >
>> > +static uint64_t amd_parse_freq(const struct cpuinfo_x86 *c, uint64_t
>> > +value) {
>> > +   ASSERT(c->x86 <= 0x1A);
>> > +
>> > +   if (c->x86 < 0x17)
>> > +           return (((value & 0x3f) + 0x10) * 100) >> ((value >> 6) & 7);
>> > +   else if (c->x86 <= 0x19)
>> > +           return ((value & 0xff) * 25 * 8) / ((value >> 8) & 0x3f);
>> > +   else
>> > +           return (value & 0xfff) * 5;
>> > +}
>> 
>> Could I talk you into omitting the unnecessary "else" in cases like 
>> this one?
>> (This may also make sense to express as switch().)
>> 
> 
> Sorry, bad habit... will change it to switch
> 
>> > @@ -658,19 +670,20 @@ void amd_log_freq(const struct cpuinfo_x86 *c)
>> >     if (!(lo >> 63))
>> >             return;
>> >
>> > -#define FREQ(v) (c->x86 < 0x17 ? ((((v) & 0x3f) + 0x10) * 100) >> (((v) >> 6) &
>> 7) \
>> > -                                : (((v) & 0xff) * 25 * 8) / (((v) >> 8) & 0x3f))
>> >     if (idx && idx < h &&
>> >         !rdmsr_safe(0xC0010064 + idx, val) && (val >> 63) &&
>> >         !rdmsr_safe(0xC0010064, hi) && (hi >> 63))
>> >             printk("CPU%u: %lu (%lu ... %lu) MHz\n",
>> > -                  smp_processor_id(), FREQ(val), FREQ(lo), FREQ(hi));
>> > +                  smp_processor_id(),
>> > +                  amd_parse_freq(c, val),
>> > +                  amd_parse_freq(c, lo), amd_parse_freq(c, hi));
>> 
>> I fear Misra won't like multiple function calls to evaluate the 
>> parameters to pass to
>> another function. Iirc smp_process_id() has special exception, so 
>> that's okay here.
>> This may be possible to alleviate by marking the new helper pure or 
>> even const
>> (see gcc doc as to caveats with passing pointers to const functions). 
>> Cc-ing Nicola
>> for possible clarification or correction.
>> 
> 
> Maybe we shall declare the function __pure. Having checked the gcc doc,
> ``
> a function that has pointer arguments must not be declared const
> ``
> Otherwise we store the "c->x86" value to avoid using the pointer
> 

Either way could work. ECLAIR will automatically pick up 
__attribute__((pure)) or __attribute__((const)) from the declaration. 
Maybe it could be const, as from a cursory look I don't think the gcc 
restriction on pointer arguments applies, as the pointee is not modified 
between successive calls, but I might be mistaken.

>> Jan

-- 
Nicola Vetrini, B.Sc.
Software Engineer
BUGSENG (https://bugseng.com)
LinkedIn: https://www.linkedin.com/in/nicola-vetrini-a42471253


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 07/15] xen/cpufreq: fix core frequency calculation for AMD Family 1Ah CPUs
  2025-03-26 10:14       ` Nicola Vetrini
@ 2025-03-26 10:19         ` Jan Beulich
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2025-03-26 10:19 UTC (permalink / raw)
  To: Nicola Vetrini, Penny, Zheng
  Cc: Huang, Ray, Andrew Cooper, Roger Pau Monné, xen-devel

On 26.03.2025 11:14, Nicola Vetrini wrote:
> On 2025-03-26 10:54, Penny, Zheng wrote:
>>> -----Original Message-----
>>> From: Jan Beulich <jbeulich@suse.com>
>>> Sent: Monday, March 24, 2025 11:48 PM
>>>
>>> On 06.03.2025 09:39, Penny Zheng wrote:
>>>> This commit fixes core frequency calculation for AMD Family 1Ah CPUs,
>>>> due to a change in the PStateDef MSR layout in AMD Family 1Ah+.
>>>> In AMD Family 1Ah+, Core current operating frequency in MHz is
>>>> calculated as
>>>> follows:
>>>
>>> Why 1Ah+? In the code you correctly limit to just 1Ah.
>>>
>>>> --- a/xen/arch/x86/cpu/amd.c
>>>> +++ b/xen/arch/x86/cpu/amd.c
>>>> @@ -572,12 +572,24 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
>>>>                                                            :
>>>> c->cpu_core_id);  }
>>>>
>>>> +static uint64_t amd_parse_freq(const struct cpuinfo_x86 *c, uint64_t
>>>> +value) {
>>>> +   ASSERT(c->x86 <= 0x1A);
>>>> +
>>>> +   if (c->x86 < 0x17)
>>>> +           return (((value & 0x3f) + 0x10) * 100) >> ((value >> 6) & 7);
>>>> +   else if (c->x86 <= 0x19)
>>>> +           return ((value & 0xff) * 25 * 8) / ((value >> 8) & 0x3f);
>>>> +   else
>>>> +           return (value & 0xfff) * 5;
>>>> +}
>>>
>>> Could I talk you into omitting the unnecessary "else" in cases like 
>>> this one?
>>> (This may also make sense to express as switch().)
>>>
>>
>> Sorry, bad habit... will change it to switch
>>
>>>> @@ -658,19 +670,20 @@ void amd_log_freq(const struct cpuinfo_x86 *c)
>>>>     if (!(lo >> 63))
>>>>             return;
>>>>
>>>> -#define FREQ(v) (c->x86 < 0x17 ? ((((v) & 0x3f) + 0x10) * 100) >> (((v) >> 6) &
>>> 7) \
>>>> -                                : (((v) & 0xff) * 25 * 8) / (((v) >> 8) & 0x3f))
>>>>     if (idx && idx < h &&
>>>>         !rdmsr_safe(0xC0010064 + idx, val) && (val >> 63) &&
>>>>         !rdmsr_safe(0xC0010064, hi) && (hi >> 63))
>>>>             printk("CPU%u: %lu (%lu ... %lu) MHz\n",
>>>> -                  smp_processor_id(), FREQ(val), FREQ(lo), FREQ(hi));
>>>> +                  smp_processor_id(),
>>>> +                  amd_parse_freq(c, val),
>>>> +                  amd_parse_freq(c, lo), amd_parse_freq(c, hi));
>>>
>>> I fear Misra won't like multiple function calls to evaluate the 
>>> parameters to pass to
>>> another function. Iirc smp_process_id() has special exception, so 
>>> that's okay here.
>>> This may be possible to alleviate by marking the new helper pure or 
>>> even const
>>> (see gcc doc as to caveats with passing pointers to const functions). 
>>> Cc-ing Nicola
>>> for possible clarification or correction.
>>>
>>
>> Maybe we shall declare the function __pure. Having checked the gcc doc,
>> ``
>> a function that has pointer arguments must not be declared const
>> ``
>> Otherwise we store the "c->x86" value to avoid using the pointer
> 
> Either way could work. ECLAIR will automatically pick up 
> __attribute__((pure)) or __attribute__((const)) from the declaration. 
> Maybe it could be const, as from a cursory look I don't think the gcc 
> restriction on pointer arguments applies, as the pointee is not modified 
> between successive calls, but I might be mistaken.

Indeed this matches my reading of it. Yet things are somewhat delicate here,
so I like to always leave room for being proven wrong.

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 03/15] xen/cpufreq: refactor cmdline "cpufreq=xxx"
  2025-03-26  7:20     ` Penny, Zheng
@ 2025-03-26 10:43       ` Jan Beulich
  2025-04-01  5:44         ` Penny, Zheng
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2025-03-26 10:43 UTC (permalink / raw)
  To: Penny, Zheng
  Cc: Huang, Ray, Andrew Cooper, Anthony PERARD, Orzel, Michal,
	Julien Grall, Roger Pau Monné, Stefano Stabellini,
	xen-devel@lists.xenproject.org

On 26.03.2025 08:20, Penny, Zheng wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Monday, March 24, 2025 11:01 PM
>>
>> On 06.03.2025 09:39, Penny Zheng wrote:
>>> --- a/docs/misc/xen-command-line.pandoc
>>> +++ b/docs/misc/xen-command-line.pandoc
>>> @@ -535,7 +535,8 @@ choice of `dom0-kernel` is deprecated and not supported
>> by all Dom0 kernels.
>>>    processor to autonomously force physical package components into idle state.
>>>    The default is enabled, but the option only applies when `hwp` is enabled.
>>>
>>> -There is also support for `;`-separated fallback options:
>>> +User could use `;`-separated options to support universal options
>>> +which they would like to try on any agnostic platform, *but* under
>>> +priority order, like
>>>  `cpufreq=hwp;xen,verbose`.  This first tries `hwp` and falls back to
>>> `xen` if  unavailable.  Note: The `verbose` suboption is handled
>>> globally.  Setting it  for either the primary or fallback option
>>> applies to both irrespective of where
>>
>> What does "support" here mean? I fear I can't even suggest what else to use, as I
>> don't follow what additional information you mean to add here. Is a change here
>> really needed?
> 
> There are two changes I'd like to address:
> 1) ";" is not designed for fallback options anymore, like we discussed before, we would
> like to support something like "cpufreq=hwp;amd-cppc;xen" for users to define all universal options
> they would like to try.

Why would the meaning of ; change? There's no difference between having a single
fallback option from hwp, or two of them from amd-cppc.

> 2) Must under *priority* order. As in cpufreq_driver_init(), we are using loop to decide which driver to
> try firstly. If user defines "cpufreq=xen;amd-cppc", which leads legacy P-state set before amd-cppc in cpufreq_xen_opts[],
> then in the loop, we will try to register legacy P-state firstly, once it gets registered successfully, we will not try to register amd-cppc at all.

This in-order aspect also doesn't change.

Overall I fear I don't feel my question was answered.

>>> --- a/xen/drivers/cpufreq/cpufreq.c
>>> +++ b/xen/drivers/cpufreq/cpufreq.c
>>> +    if ( arg[0] && arg[1] )
>>> +        ret = cpufreq_cmdline_parse(arg + 1, end);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static int __init cpufreq_cmdline_parse_hwp(const char *arg, const
>>> +char *end) {
>>> +    int ret = 0;
>>> +
>>> +    xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
>>> +    cpufreq_controller = FREQCTL_xen;
>>> +    cpufreq_xen_opts[cpufreq_xen_cnt++] = CPUFREQ_hwp;
>>> +    if ( arg[0] && arg[1] )
>>> +        ret = hwp_cmdline_parse(arg + 1, end);
>>> +
>>> +    return ret;
>>> +}
>>
>> For both of the helpers may I suggest s/parse/process/ or some such ("handle"
>> might be another possible term to use), as themselves they don't do any parsing?
>>
> 
> Maybe I mis-understood the previous comment you said
> ```
>         >          else if ( IS_ENABLED(CONFIG_INTEL) && choice < 0 &&
>         > ```
> 
>         For the rest of this, I guess I'd prefer to see this in context. Also with
>         regard to the helper function's name.
> ```
> I thought you suggested to introduce helper function to wrap the conditional codes...
> Or may you were suggesting something like:
> ```
> #ifdef CONFIG_INTEL
> else if ( choice < 0 && !cmdline_strcmp(str, "hwp") )
> {
>     xen_processor_pmbits |= XEN_PROCES
>     ...
> }
> #endif
> ```

Was this reply of yours misplaced? It doesn't fit with the part of my reply in
context above. Or maybe I'm not understanding what you mean to say.

>> In the end I'm also not entirely convinced that we need these two almost identical
>> helpers (with a 3rd likely appearing in a later patch).

Instead it feels as if this response of yours was to this part of my comment.
Indeed iirc I was suggesting to introduce a helper function. Note, however, the
singular here as well as in your response above.

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 05/15] xen/x86: introduce "cpufreq=amd-cppc" xen cmdline
  2025-03-26  8:35     ` Penny, Zheng
@ 2025-03-26 10:55       ` Jan Beulich
  2025-03-27  3:12         ` Penny, Zheng
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2025-03-26 10:55 UTC (permalink / raw)
  To: Penny, Zheng
  Cc: Huang, Ray, Andrew Cooper, Anthony PERARD, Orzel, Michal,
	Julien Grall, Roger Pau Monné, Stefano Stabellini,
	xen-devel@lists.xenproject.org

On 26.03.2025 09:35, Penny, Zheng wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Monday, March 24, 2025 11:26 PM
>>
>> On 06.03.2025 09:39, Penny Zheng wrote:
>>> @@ -514,5 +515,14 @@ acpi_cpufreq_driver = {
>>>
>>>  int __init acpi_cpufreq_register(void)  {
>>> -    return cpufreq_register_driver(&acpi_cpufreq_driver);
>>> +    int ret;
>>> +
>>> +    ret = cpufreq_register_driver(&acpi_cpufreq_driver);
>>> +    if ( ret )
>>> +        return ret;
>>> +
>>> +    if ( IS_ENABLED(CONFIG_AMD) )
>>> +        xen_processor_pmbits &= ~XEN_PROCESSOR_PM_CPPC;
>>
>> What's the purpose of the if() here?
> 
> After cpufreq driver properly registered, I'd like XEN_PROCESSOR_PM_PX and XEN_PROCESSOR_PM_CPPC
> being exclusive value to represent the actual underlying registered driver.
> As users could define something like "cpufreq=amd-cppc,xen", which implies both XEN_PROCESSOR_PM_PX and XEN_PROCESSOR_PM_CPPC
> got set in parsing logic. With amd-cppc failing to register, we are falling back to legacy ones. Then XEN_PROCESSOR_PM_CPPC needs to clear.

Looks like you try to explain the &= when my question was about the if().
I understand the purpose of the &=. What I don't understand is why it needs
to be conditional.

>>> --- a/xen/include/acpi/cpufreq/cpufreq.h
>>> +++ b/xen/include/acpi/cpufreq/cpufreq.h
>>> @@ -28,6 +28,7 @@ enum cpufreq_xen_opt {
>>>      CPUFREQ_none,
>>>      CPUFREQ_xen,
>>>      CPUFREQ_hwp,
>>> +    CPUFREQ_amd_cppc,
>>>  };
>>>  extern enum cpufreq_xen_opt cpufreq_xen_opts[2];
>>
>> I'm pretty sure I pointed out before that this array needs to grow, now that you add a
>> 3rd kind of handling.
>>
> 
> Hmmm, but the CPUFREQ_hwp and CPUFREQ_amd_cppc are incompatible options.
> I thought cpufreq_xen_opts[] shall reflect available choices on their hardware.
> Even if users define "cpufreq=hwp;amd-cppc;xen", in Intel platform, cpufreq_xen_opts[] shall
> contain  CPUFREQ_hwp and CPUFREQ_xen, while in amd platform, cpufreq_xen_opts[] shall
> contain CPUFREQ_amd_cppc and CPUFREQ_xen

Maybe I misread the code, but the impression I got was that "cpufreq=hwp;amd-cppc;xen"
would populate 3 slots of the array (with one of "hwp" and "amd-cppc" necessarily not
working, leading to the next one to be tried).

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 14/15] xen/xenpm: Adapt cpu frequency monitor in xenpm
  2025-03-25 11:26   ` Jan Beulich
  2025-03-25 16:37     ` Jason Andryuk
@ 2025-03-26 15:45     ` Anthony PERARD
  1 sibling, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2025-03-26 15:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Penny Zheng, Jason Andryuk, ray.huang, Juergen Gross, xen-devel

On Tue, Mar 25, 2025 at 12:26:09PM +0100, Jan Beulich wrote:
> On 06.03.2025 09:39, Penny Zheng wrote:
> > Make `xenpm get-cpureq-para/set-cpufreq-para` available in CPPC mode.
> > --- a/tools/libs/ctrl/xc_pm.c
> > +++ b/tools/libs/ctrl/xc_pm.c
> > @@ -214,13 +214,12 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
> > @@ -301,7 +302,8 @@ unlock_4:
> >      if ( user_para->gov_num )
> >          xc_hypercall_bounce_post(xch, scaling_available_governors);
> >  unlock_3:
> > -    xc_hypercall_bounce_post(xch, scaling_available_frequencies);
> > +    if ( user_para->freq_num )
> > +        xc_hypercall_bounce_post(xch, scaling_available_frequencies);
> >  unlock_2:
> >      xc_hypercall_bounce_post(xch, affected_cpus);
> >  unlock_1:
> 
> I'm also puzzled by the function's inconsistent return value - Anthony,
> can you explain / spot why things are the way they are?

Looks like 73367cf3b4b4 ("libxc: Fix xc_pm API calls to return negative
error and stash error in errno.") made some changes, and fixed some
return value to be like described in "xenctrl.h", but I guess failed to
also change the "ret = -errno".

Cheers,

-- 

Anthony Perard | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech


^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v3 05/15] xen/x86: introduce "cpufreq=amd-cppc" xen cmdline
  2025-03-26 10:55       ` Jan Beulich
@ 2025-03-27  3:12         ` Penny, Zheng
  2025-03-27  7:48           ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Penny, Zheng @ 2025-03-27  3:12 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Huang, Ray, Andrew Cooper, Anthony PERARD, Orzel, Michal,
	Julien Grall, Roger Pau Monné, Stefano Stabellini,
	xen-devel@lists.xenproject.org

[Public]

Hi,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Wednesday, March 26, 2025 6:55 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Anthony PERARD <anthony.perard@vates.tech>;
> Orzel, Michal <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Roger Pau
> Monné <roger.pau@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>; xen-
> devel@lists.xenproject.org
> Subject: Re: [PATCH v3 05/15] xen/x86: introduce "cpufreq=amd-cppc" xen
> cmdline
>
> On 26.03.2025 09:35, Penny, Zheng wrote:
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Monday, March 24, 2025 11:26 PM
> >>
> >> On 06.03.2025 09:39, Penny Zheng wrote:
> >>> @@ -514,5 +515,14 @@ acpi_cpufreq_driver = {
> >>>
> >>>  int __init acpi_cpufreq_register(void)  {
> >>> -    return cpufreq_register_driver(&acpi_cpufreq_driver);
> >>> +    int ret;
> >>> +
> >>> +    ret = cpufreq_register_driver(&acpi_cpufreq_driver);
> >>> +    if ( ret )
> >>> +        return ret;
> >>> +
> >>> +    if ( IS_ENABLED(CONFIG_AMD) )
> >>> +        xen_processor_pmbits &= ~XEN_PROCESSOR_PM_CPPC;
> >>
> >> What's the purpose of the if() here?
> >
> > After cpufreq driver properly registered, I'd like XEN_PROCESSOR_PM_PX
> > and XEN_PROCESSOR_PM_CPPC being exclusive value to represent the
> actual underlying registered driver.
> > As users could define something like "cpufreq=amd-cppc,xen", which
> > implies both XEN_PROCESSOR_PM_PX and XEN_PROCESSOR_PM_CPPC
> got set in parsing logic. With amd-cppc failing to register, we are falling back to
> legacy ones. Then XEN_PROCESSOR_PM_CPPC needs to clear.
>
> Looks like you try to explain the &= when my question was about the if().
> I understand the purpose of the &=. What I don't understand is why it needs to be
> conditional.
>

Oh, I got your concern, and I'll remove.

> >>> --- a/xen/include/acpi/cpufreq/cpufreq.h
> >>> +++ b/xen/include/acpi/cpufreq/cpufreq.h
> >>> @@ -28,6 +28,7 @@ enum cpufreq_xen_opt {
> >>>      CPUFREQ_none,
> >>>      CPUFREQ_xen,
> >>>      CPUFREQ_hwp,
> >>> +    CPUFREQ_amd_cppc,
> >>>  };
> >>>  extern enum cpufreq_xen_opt cpufreq_xen_opts[2];
> >>
> >> I'm pretty sure I pointed out before that this array needs to grow,
> >> now that you add a 3rd kind of handling.
> >>
> >
> > Hmmm, but the CPUFREQ_hwp and CPUFREQ_amd_cppc are incompatible
> options.
> > I thought cpufreq_xen_opts[] shall reflect available choices on their hardware.
> > Even if users define "cpufreq=hwp;amd-cppc;xen", in Intel platform,
> > cpufreq_xen_opts[] shall contain  CPUFREQ_hwp and CPUFREQ_xen, while
> > in amd platform, cpufreq_xen_opts[] shall contain CPUFREQ_amd_cppc and
> > CPUFREQ_xen
>
> Maybe I misread the code, but the impression I got was that "cpufreq=hwp;amd-
> cppc;xen"

My bad. In my platform, I haven't enabled the CONFIG_INTEL. I previously assumed that
CONFIG_INTEL and CONFIG_AMD are incompatible options, which leads to the following code
```
else if ( IS_ENABLED(CONFIG_INTEL) && choice < 0 &&
          !cmdline_strcmp(str, "hwp") )
{
    xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
    cpufreq_controller = FREQCTL_xen;
```
shall not be working in AMD platform...
May I ask why not make them incompatible pair? I assumed it each wraps vendor-specific feature, like vmx vs svm...

> would populate 3 slots of the array (with one of "hwp" and "amd-cppc" necessarily
> not working, leading to the next one to be tried).
>
> Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 05/15] xen/x86: introduce "cpufreq=amd-cppc" xen cmdline
  2025-03-27  3:12         ` Penny, Zheng
@ 2025-03-27  7:48           ` Jan Beulich
  2025-03-28  4:43             ` Penny, Zheng
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2025-03-27  7:48 UTC (permalink / raw)
  To: Penny, Zheng
  Cc: Huang, Ray, Andrew Cooper, Anthony PERARD, Orzel, Michal,
	Julien Grall, Roger Pau Monné, Stefano Stabellini,
	xen-devel@lists.xenproject.org

On 27.03.2025 04:12, Penny, Zheng wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Wednesday, March 26, 2025 6:55 PM
>>
>> On 26.03.2025 09:35, Penny, Zheng wrote:
>>>> -----Original Message-----
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: Monday, March 24, 2025 11:26 PM
>>>>
>>>> On 06.03.2025 09:39, Penny Zheng wrote:
>>>>> --- a/xen/include/acpi/cpufreq/cpufreq.h
>>>>> +++ b/xen/include/acpi/cpufreq/cpufreq.h
>>>>> @@ -28,6 +28,7 @@ enum cpufreq_xen_opt {
>>>>>      CPUFREQ_none,
>>>>>      CPUFREQ_xen,
>>>>>      CPUFREQ_hwp,
>>>>> +    CPUFREQ_amd_cppc,
>>>>>  };
>>>>>  extern enum cpufreq_xen_opt cpufreq_xen_opts[2];
>>>>
>>>> I'm pretty sure I pointed out before that this array needs to grow,
>>>> now that you add a 3rd kind of handling.
>>>>
>>>
>>> Hmmm, but the CPUFREQ_hwp and CPUFREQ_amd_cppc are incompatible
>> options.
>>> I thought cpufreq_xen_opts[] shall reflect available choices on their hardware.
>>> Even if users define "cpufreq=hwp;amd-cppc;xen", in Intel platform,
>>> cpufreq_xen_opts[] shall contain  CPUFREQ_hwp and CPUFREQ_xen, while
>>> in amd platform, cpufreq_xen_opts[] shall contain CPUFREQ_amd_cppc and
>>> CPUFREQ_xen
>>
>> Maybe I misread the code, but the impression I got was that "cpufreq=hwp;amd-
>> cppc;xen"
> 
> My bad. In my platform, I haven't enabled the CONFIG_INTEL. I previously assumed that
> CONFIG_INTEL and CONFIG_AMD are incompatible options, which leads to the following code
> ```
> else if ( IS_ENABLED(CONFIG_INTEL) && choice < 0 &&
>           !cmdline_strcmp(str, "hwp") )
> {
>     xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
>     cpufreq_controller = FREQCTL_xen;
> ```
> shall not be working in AMD platform...
> May I ask why not make them incompatible pair? I assumed it each wraps vendor-specific feature, like vmx vs svm...

I'm sorry to say this, but that seems like a pretty odd question to ask. Distros
quite clearly want to build one single hypervisor which can be used on both
Intel and AMD hardware. CONFIG_* are build-time constants after all, not runtime
values. We use them in if() where possible (instead of in #if / #ifdef) simply
to expose as much code as possible to at least syntax and alike checking by the
compiler, irrespective of configuration used by a particular individual. This
way we limit the risk of bit-rotting and unexpected build failures at least some.

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v3 08/15] xen/amd: export processor max frequency value
  2025-03-24 15:52   ` Jan Beulich
@ 2025-03-27  8:38     ` Penny, Zheng
  0 siblings, 0 replies; 60+ messages in thread
From: Penny, Zheng @ 2025-03-27  8:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Huang, Ray, Andrew Cooper, Roger Pau Monné,
	xen-devel@lists.xenproject.org

[Public]

Hi,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, March 24, 2025 11:52 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Roger Pau Monné <roger.pau@citrix.com>; xen-
> devel@lists.xenproject.org
> Subject: Re: [PATCH v3 08/15] xen/amd: export processor max frequency value
>
> On 06.03.2025 09:39, Penny Zheng wrote:
> > --- a/xen/arch/x86/cpu/amd.c
> > +++ b/xen/arch/x86/cpu/amd.c
> > @@ -56,6 +56,8 @@ bool __initdata amd_virt_spec_ctrl;
> >
> >  static bool __read_mostly fam17_c6_disabled;
> >
> > +DEFINE_PER_CPU_READ_MOSTLY(uint64_t, amd_max_freq_mhz);
> > +
> >  static inline int rdmsr_amd_safe(unsigned int msr, unsigned int *lo,
> >                              unsigned int *hi)
> >  {
> > @@ -681,9 +683,15 @@ void amd_log_freq(const struct cpuinfo_x86 *c)
> >             printk("CPU%u: %lu ... %lu MHz\n",
> >                    smp_processor_id(),
> >                    amd_parse_freq(c, lo), amd_parse_freq(c, hi));
> > -   else
> > +   else {
> >             printk("CPU%u: %lu MHz\n", smp_processor_id(),
> >                    amd_parse_freq(c, lo));
> > +           return;
> > +   }
> > +
> > +   /* Store max frequency for amd-cppc cpufreq driver */
> > +   if (hi >> 63)
> > +           this_cpu(amd_max_freq_mhz) = amd_parse_freq(c, hi);
> >  }
>
> As before - typically only the BSP will make it here, due to the conditional at the top
> of the function. IOW you'll observe zeros in the per-CPU data for all other CPUs.
>

I'll extract the processing frequency logic into a new helper, maybe amd_process_freq()

> > --- a/xen/arch/x86/include/asm/amd.h
> > +++ b/xen/arch/x86/include/asm/amd.h
> > @@ -174,4 +174,5 @@ bool amd_setup_legacy_ssbd(void);  void
> > amd_set_legacy_ssbd(bool enable);  void amd_set_cpuid_user_dis(bool
> > enable);
> >
> > +DECLARE_PER_CPU(uint64_t, amd_max_freq_mhz);
> >  #endif /* __AMD_H__ */
>
> I'm also pretty sure that I did ask before to maintain a blank line ahead of the
> #endif. Please may I ask that you thoroughly address earlier review comments,
> before submitting a new version?
>

Sorry, I'll be more careful.

> Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v3 12/15] xen/x86: implement EPP support for the amd-cppc driver in active mode
  2025-03-25 10:48   ` Jan Beulich
@ 2025-03-28  4:07     ` Penny, Zheng
  2025-03-28  7:18       ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Penny, Zheng @ 2025-03-28  4:07 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Huang, Ray, Andrew Cooper, Anthony PERARD, Orzel, Michal,
	Julien Grall, Roger Pau Monné, Stefano Stabellini,
	xen-devel@lists.xenproject.org

[Public]

Hi,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, March 25, 2025 6:49 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Anthony PERARD <anthony.perard@vates.tech>;
> Orzel, Michal <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Roger
> Pau Monné <roger.pau@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>;
> xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v3 12/15] xen/x86: implement EPP support for the amd-cppc
> driver in active mode
>
> On 06.03.2025 09:39, Penny Zheng wrote:
> > amd-cppc has 2 operation modes: autonomous (active) mode,
> > non-autonomous (passive) mode.
> > In active mode, platform ignores the requestd done in the Desired
> > Performance Target register and takes into account only the values set
> > to the minimum, maximum and energy performance preference(EPP)
> > registers.
> > The EPP is used in the CCLK DPM controller to drive the frequency that
> > a core is going to operate during short periods of activity.
> > The SOC EPP targets are configured on a scale from 0 to 255 where 0
> > represents maximum performance and 255 represents maximum efficiency.
>
> So this is the other way around from "perf" values, where aiui 0xff is "highest"?
>

Yes, it is not the perf value. It is an arbitrary value on a scale from 0 to 255

> > @@ -261,7 +276,20 @@ static int cf_check amd_cppc_cpufreq_target(struct
> cpufreq_policy *policy,
> >          return res;
> >
> >      return amd_cppc_write_request(policy->cpu, data-
> >caps.lowest_nonlinear_perf,
> > -                                  des_perf, data->caps.highest_perf);
> > +                                  des_perf, data->caps.highest_perf,
> > +                                  /* Pre-defined BIOS value for passive mode */
> > +                                  per_cpu(epp_init, policy->cpu)); }
> > +
> > +static int read_epp_init(void)
> > +{
> > +    uint64_t val;
> > +
> > +    if ( rdmsr_safe(MSR_AMD_CPPC_REQ, val) )
> > +        return -EINVAL;
>
> I'm unconvinced of using rdmsr_safe() everywhere (i.e. this also goes for earlier
> patches). Unless you can give a halfway reasonable scenario under which by the
> time we get here there's still a chance that the MSR isn't implemented in the next
> lower layer (hardware or another hypervisor, just to explain what's meant, without
> me assuming that the driver should come into play in the first place when we run
> virtualized ourselves).
>

Correct me if I understand wrongly, we are concerning that the driver may not always
have the privilege to directly access the MSR in all scenarios, so rdmsr_safe with exception
handling isn't always suitable. Then maybe I shall switch them all into rdmsrl() ?

> Furthermore you call this function unconditionally, i.e. if there was a chance for the
> MSR read to fail, CPU init would needlessly fail when in passive mode.
>

The reason why I also run read_epp_init() for passive mode is to avoid setting epp with zero value
for MSR_AMD_CPPC_REQ in passive mode. I want to give it pre-defined BIOS value in passive mode.
If we wrap read_epp_init() with active mode check, maybe we shall add extra read before setting request register MSR_AMD_CPPC_REQ,
introducing MSR_AMD_CPPC_EPP_MASK to reserve original value for epp in passive mode, or any better suggestion?

> > +    {
> > +        /* Force the epp value to be zero for performance policy */
> > +        epp = CPPC_ENERGY_PERF_MAX_PERFORMANCE;
> > +        min_perf = max_perf;
> > +    }
> > +    else if ( policy->policy == CPUFREQ_POLICY_POWERSAVE )
> > +        /* Force the epp value to be 0xff for powersave policy */
> > +        /*
> > +         * If set max_perf = min_perf = lowest_perf, we are putting
> > +         * cpu cores in idle.
> > +         */
>
> Nit: Such two successive comments want combining. (Same near the top of the
> function, as I notice only now.)
>
> Furthermore I'm in trouble with interpreting this comment: To me "lowest"
> doesn't mean "doing nothing" but "doing things as efficiently in terms of power use
> as possible". IOW that's not idle. Yet the comment reads as if it was meant to be an
> explanation of why we can't set max_perf from min_perf here. That is, not matter
> what's meant to be said, I think this needs re- wording (and possibly using
> subjunctive mood).
>

How about:
The lowest non-linear perf is equivalent as P2 frequency. Reducing performance below this
point does not lead to total energy savings for a given computation (although it reduces momentary power).
So we are not suggesting to set max_perf smaller than lowest non-linear perf, or even the lowest perf.

> Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v3 05/15] xen/x86: introduce "cpufreq=amd-cppc" xen cmdline
  2025-03-27  7:48           ` Jan Beulich
@ 2025-03-28  4:43             ` Penny, Zheng
  0 siblings, 0 replies; 60+ messages in thread
From: Penny, Zheng @ 2025-03-28  4:43 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Huang, Ray, Andrew Cooper, Anthony PERARD, Orzel, Michal,
	Julien Grall, Roger Pau Monné, Stefano Stabellini,
	xen-devel@lists.xenproject.org

[Public]

Hi,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Thursday, March 27, 2025 3:48 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Anthony PERARD <anthony.perard@vates.tech>;
> Orzel, Michal <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Roger
> Pau Monné <roger.pau@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>;
> xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v3 05/15] xen/x86: introduce "cpufreq=amd-cppc" xen
> cmdline
>
> On 27.03.2025 04:12, Penny, Zheng wrote:
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Wednesday, March 26, 2025 6:55 PM
> >>
> >> On 26.03.2025 09:35, Penny, Zheng wrote:
> >>>> -----Original Message-----
> >>>> From: Jan Beulich <jbeulich@suse.com>
> >>>> Sent: Monday, March 24, 2025 11:26 PM
> >>>>
> >>>> On 06.03.2025 09:39, Penny Zheng wrote:
> >>>>> --- a/xen/include/acpi/cpufreq/cpufreq.h
> >>>>> +++ b/xen/include/acpi/cpufreq/cpufreq.h
> >>>>> @@ -28,6 +28,7 @@ enum cpufreq_xen_opt {
> >>>>>      CPUFREQ_none,
> >>>>>      CPUFREQ_xen,
> >>>>>      CPUFREQ_hwp,
> >>>>> +    CPUFREQ_amd_cppc,
> >>>>>  };
> >>>>>  extern enum cpufreq_xen_opt cpufreq_xen_opts[2];
> >>>>
> >>>> I'm pretty sure I pointed out before that this array needs to grow,
> >>>> now that you add a 3rd kind of handling.
> >>>>
> >>>
> >>> Hmmm, but the CPUFREQ_hwp and CPUFREQ_amd_cppc are incompatible
> >> options.
> >>> I thought cpufreq_xen_opts[] shall reflect available choices on their hardware.
> >>> Even if users define "cpufreq=hwp;amd-cppc;xen", in Intel platform,
> >>> cpufreq_xen_opts[] shall contain  CPUFREQ_hwp and CPUFREQ_xen, while
> >>> in amd platform, cpufreq_xen_opts[] shall contain CPUFREQ_amd_cppc
> >>> and CPUFREQ_xen
> >>
> >> Maybe I misread the code, but the impression I got was that
> >> "cpufreq=hwp;amd- cppc;xen"
> >
> > My bad. In my platform, I haven't enabled the CONFIG_INTEL. I
> > previously assumed that CONFIG_INTEL and CONFIG_AMD are incompatible
> > options, which leads to the following code ``` else if (
> > IS_ENABLED(CONFIG_INTEL) && choice < 0 &&
> >           !cmdline_strcmp(str, "hwp") ) {
> >     xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
> >     cpufreq_controller = FREQCTL_xen;
> > ```
> > shall not be working in AMD platform...
> > May I ask why not make them incompatible pair? I assumed it each wraps
> vendor-specific feature, like vmx vs svm...
>
> I'm sorry to say this, but that seems like a pretty odd question to ask. Distros quite
> clearly want to build one single hypervisor which can be used on both Intel and
> AMD hardware. CONFIG_* are build-time constants after all, not runtime values.
> We use them in if() where possible (instead of in #if / #ifdef) simply to expose as
> much code as possible to at least syntax and alike checking by the compiler,
> irrespective of configuration used by a particular individual. This way we limit the
> risk of bit-rotting and unexpected build failures at least some.
>

Thanks for the detailed explanation, understood!

> Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 12/15] xen/x86: implement EPP support for the amd-cppc driver in active mode
  2025-03-28  4:07     ` Penny, Zheng
@ 2025-03-28  7:18       ` Jan Beulich
  2025-04-08 10:32         ` Penny, Zheng
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2025-03-28  7:18 UTC (permalink / raw)
  To: Penny, Zheng
  Cc: Huang, Ray, Andrew Cooper, Anthony PERARD, Orzel, Michal,
	Julien Grall, Roger Pau Monné, Stefano Stabellini,
	xen-devel@lists.xenproject.org

On 28.03.2025 05:07, Penny, Zheng wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Tuesday, March 25, 2025 6:49 PM
>>
>> On 06.03.2025 09:39, Penny Zheng wrote:
>>> @@ -261,7 +276,20 @@ static int cf_check amd_cppc_cpufreq_target(struct
>> cpufreq_policy *policy,
>>>          return res;
>>>
>>>      return amd_cppc_write_request(policy->cpu, data-
>>> caps.lowest_nonlinear_perf,
>>> -                                  des_perf, data->caps.highest_perf);
>>> +                                  des_perf, data->caps.highest_perf,
>>> +                                  /* Pre-defined BIOS value for passive mode */
>>> +                                  per_cpu(epp_init, policy->cpu)); }
>>> +
>>> +static int read_epp_init(void)
>>> +{
>>> +    uint64_t val;
>>> +
>>> +    if ( rdmsr_safe(MSR_AMD_CPPC_REQ, val) )
>>> +        return -EINVAL;
>>
>> I'm unconvinced of using rdmsr_safe() everywhere (i.e. this also goes for earlier
>> patches). Unless you can give a halfway reasonable scenario under which by the
>> time we get here there's still a chance that the MSR isn't implemented in the next
>> lower layer (hardware or another hypervisor, just to explain what's meant, without
>> me assuming that the driver should come into play in the first place when we run
>> virtualized ourselves).
>>
> 
> Correct me if I understand wrongly, we are concerning that the driver may not always
> have the privilege to directly access the MSR in all scenarios, so rdmsr_safe with exception
> handling isn't always suitable. Then maybe I shall switch them all into rdmsrl() ?

There's no privilege question here - we're running at the highest possible privilege
level. The only question in MSR access can concern the existence of these MSRs (on
bare hardware) or improper emulation of MSRs by an underlying hypervisor. The latter
case I think we can pretty much exclude for a driver like this one - the driver
simply has no (real) use when running virtualized. Which leaves errata on hardware.
Those would better be dealt with by checking once up front (and then disabling the
driver if need be). IOW except for perhaps a single probing access early in driver
init, I think these better would all be plain accesses. And even such an early
probing access would likely only need switching to when a relevant erratum becomes
known.

>> Furthermore you call this function unconditionally, i.e. if there was a chance for the
>> MSR read to fail, CPU init would needlessly fail when in passive mode.
>>
> 
> The reason why I also run read_epp_init() for passive mode is to avoid setting epp with zero value
> for MSR_AMD_CPPC_REQ in passive mode. I want to give it pre-defined BIOS value in passive mode.
> If we wrap read_epp_init() with active mode check, maybe we shall add extra read before setting request register MSR_AMD_CPPC_REQ,
> introducing MSR_AMD_CPPC_EPP_MASK to reserve original value for epp in passive mode, or any better suggestion?

Well, not using rdmsr_safe() here would make the function impossible to fail, and
hence the question itself would be moot. Otherwise my suggestion would be to ignore
the error (perhaps associated with a warning) in passive mode.

>>> +    {
>>> +        /* Force the epp value to be zero for performance policy */
>>> +        epp = CPPC_ENERGY_PERF_MAX_PERFORMANCE;
>>> +        min_perf = max_perf;
>>> +    }
>>> +    else if ( policy->policy == CPUFREQ_POLICY_POWERSAVE )
>>> +        /* Force the epp value to be 0xff for powersave policy */
>>> +        /*
>>> +         * If set max_perf = min_perf = lowest_perf, we are putting
>>> +         * cpu cores in idle.
>>> +         */
>>
>> Nit: Such two successive comments want combining. (Same near the top of the
>> function, as I notice only now.)
>>
>> Furthermore I'm in trouble with interpreting this comment: To me "lowest"
>> doesn't mean "doing nothing" but "doing things as efficiently in terms of power use
>> as possible". IOW that's not idle. Yet the comment reads as if it was meant to be an
>> explanation of why we can't set max_perf from min_perf here. That is, not matter
>> what's meant to be said, I think this needs re- wording (and possibly using
>> subjunctive mood).
> 
> How about:
> The lowest non-linear perf is equivalent as P2 frequency. Reducing performance below this
> point does not lead to total energy savings for a given computation (although it reduces momentary power).
> So we are not suggesting to set max_perf smaller than lowest non-linear perf, or even the lowest perf.

In an abstract way I think I can follow this. In the context of the code being
commented, however, I'm afraid I still can't make sense of it. Main point being
that the code commented doesn't use any of the *_perf values. It only sets the
"epp" local variable. Maybe the point of the comment is to explain why non of
the *_perf are used here, but I can't read this out of either of the proposed
texts.

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v3 02/15] xen/x86: introduce new sub-hypercall to propagate CPPC data
  2025-03-25  7:53       ` Jan Beulich
@ 2025-03-28  8:27         ` Penny, Zheng
  2025-03-28  8:36           ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Penny, Zheng @ 2025-03-28  8:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Huang, Ray, Andrew Cooper, Roger Pau Monné, Anthony PERARD,
	Orzel, Michal, Julien Grall, Stefano Stabellini,
	xen-devel@lists.xenproject.org

[Public]

Hi,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, March 25, 2025 3:54 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Roger Pau Monné <roger.pau@citrix.com>;
> Anthony PERARD <anthony.perard@vates.tech>; Orzel, Michal
> <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Stefano Stabellini
> <sstabellini@kernel.org>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v3 02/15] xen/x86: introduce new sub-hypercall to propagate
> CPPC data
>
> On 25.03.2025 05:12, Penny, Zheng wrote:
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Monday, March 24, 2025 10:28 PM
> >>
> >> On 06.03.2025 09:39, Penny Zheng wrote:
> >>> +    pm_info = processor_pminfo[cpuid];
> >>> +    /* Must already allocated in set_psd_pminfo */
> >>> +    if ( !pm_info )
> >>> +    {
> >>> +        ret = -EINVAL;
> >>> +        goto out;
> >>> +    }
> >>> +    pm_info->cppc_data = *cppc_data;
> >>> +
> >>> +    if ( cpufreq_verbose )
> >>> +        print_CPPC(&pm_info->cppc_data);
> >>> +
> >>> +    pm_info->init = XEN_CPPC_INIT;
> >>
> >> That is - whichever Dom0 invoked last will have data recorded, and
> >> the other effectively is discarded? I think a warning (perhaps a
> >> one-time one) is minimally needed to diagnose the case where one type of
> data replaces the other.
> >>
> >
> > In last v2 discussion, we are discussing that either set_px_pminfo or
> > set_cppc_pminfo shall be invoked, which means either PX data is recorded, or
> CPPC data is recorded.
> > Current logic is that, cpufreq cmdline logic will set the
> > XEN_PROCESSOR_PM_PX/CPPC flag to reflect user preference, if user
> > defines the fallback option, like "cpufreq=amd-cppc,xen", we will have both
> XEN_PROCESSOR_PM_PX | XEN_PROCESSOR_PM_CPPC set in the
> beginning.
> > Later in cpufreq driver register logic, as only one register could be
> > registered , if amd-cppc being registered successfully, it will clear the
> XEN_PROCESSOR_PM_PX flag bit.
> > But if it fails to register, fallback scheme kicks off, we will try
> > the legacy P-states, in the mean time, clearing the
> XEN_PROCESSOR_PM_CPPC.
> > We are trying to make XEN_PROCESSOR_PM_PX and
> XEN_PROCESSOR_PM_CPPC
> > exclusive values after driver registration, which will ensure us that
> > either set_px_pminfo or set_cppc_pminfo is taken in the runtime.
>
> Yet you realize that this implies Dom0 to know what configuration Xen uses, in
> order to know which data to upload. The best approach might be to have
> Dom0 upload all data it has, with us merely ignoring what we can't make use of.

PLZ correct me if I understand you wrongly:
Right now, I was letting DOM0 upload all data it has, and in the Xen:
```
    case XEN_PM_CPPC:
        if ( !(xen_processor_pmbits & XEN_PROCESSOR_PM_CPPC) )
        {
            ret = -EOPNOTSUPPED;
            break;
        }
        ret = set_cppc_pminfo(op->u.set_pminfo.id,
                              &op->u.set_pminfo.u.cppc_data);
        break;

    case XEN_PM_PX:
        if ( !(xen_processor_pmbits & XEN_PROCESSOR_PM_PX) )
        {
            ret = -EOPNOTSUPPED;
            break;
        }
        ret = set_px_pminfo(op->u.set_pminfo.id, &op->u.set_pminfo.u.perf);
        break;
```
I relied on flag XEN_PROCESSOR_PM_CPPC and XEN_PROCESSOR_PM_PX to choose which
info we shall record.
Firstly, we shall not return -EOPNOTSUPPED error above there.

> The order of uploading (CPPC first or CPPC last) shouldn't matter. Then (and only
> then, and - ftaod - only when uploading of the "wrong" kind of data doesn't result in
> an error) things can go without warning.

Then in
```
    pm_info->init = XEN_CPPC_INIT;
    ret = cpufreq_cpu_init(cpuid);
```
We shall add warning here to clarify no fallback scheme to replace now, when ret is not zero.

>
> Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 02/15] xen/x86: introduce new sub-hypercall to propagate CPPC data
  2025-03-28  8:27         ` Penny, Zheng
@ 2025-03-28  8:36           ` Jan Beulich
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2025-03-28  8:36 UTC (permalink / raw)
  To: Penny, Zheng
  Cc: Huang, Ray, Andrew Cooper, Roger Pau Monné, Anthony PERARD,
	Orzel, Michal, Julien Grall, Stefano Stabellini,
	xen-devel@lists.xenproject.org

On 28.03.2025 09:27, Penny, Zheng wrote:
> [Public]
> 
> Hi,
> 
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Tuesday, March 25, 2025 3:54 PM
>> To: Penny, Zheng <penny.zheng@amd.com>
>> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
>> <andrew.cooper3@citrix.com>; Roger Pau Monné <roger.pau@citrix.com>;
>> Anthony PERARD <anthony.perard@vates.tech>; Orzel, Michal
>> <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Stefano Stabellini
>> <sstabellini@kernel.org>; xen-devel@lists.xenproject.org
>> Subject: Re: [PATCH v3 02/15] xen/x86: introduce new sub-hypercall to propagate
>> CPPC data
>>
>> On 25.03.2025 05:12, Penny, Zheng wrote:
>>>> -----Original Message-----
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: Monday, March 24, 2025 10:28 PM
>>>>
>>>> On 06.03.2025 09:39, Penny Zheng wrote:
>>>>> +    pm_info = processor_pminfo[cpuid];
>>>>> +    /* Must already allocated in set_psd_pminfo */
>>>>> +    if ( !pm_info )
>>>>> +    {
>>>>> +        ret = -EINVAL;
>>>>> +        goto out;
>>>>> +    }
>>>>> +    pm_info->cppc_data = *cppc_data;
>>>>> +
>>>>> +    if ( cpufreq_verbose )
>>>>> +        print_CPPC(&pm_info->cppc_data);
>>>>> +
>>>>> +    pm_info->init = XEN_CPPC_INIT;
>>>>
>>>> That is - whichever Dom0 invoked last will have data recorded, and
>>>> the other effectively is discarded? I think a warning (perhaps a
>>>> one-time one) is minimally needed to diagnose the case where one type of
>> data replaces the other.
>>>>
>>>
>>> In last v2 discussion, we are discussing that either set_px_pminfo or
>>> set_cppc_pminfo shall be invoked, which means either PX data is recorded, or
>> CPPC data is recorded.
>>> Current logic is that, cpufreq cmdline logic will set the
>>> XEN_PROCESSOR_PM_PX/CPPC flag to reflect user preference, if user
>>> defines the fallback option, like "cpufreq=amd-cppc,xen", we will have both
>> XEN_PROCESSOR_PM_PX | XEN_PROCESSOR_PM_CPPC set in the
>> beginning.
>>> Later in cpufreq driver register logic, as only one register could be
>>> registered , if amd-cppc being registered successfully, it will clear the
>> XEN_PROCESSOR_PM_PX flag bit.
>>> But if it fails to register, fallback scheme kicks off, we will try
>>> the legacy P-states, in the mean time, clearing the
>> XEN_PROCESSOR_PM_CPPC.
>>> We are trying to make XEN_PROCESSOR_PM_PX and
>> XEN_PROCESSOR_PM_CPPC
>>> exclusive values after driver registration, which will ensure us that
>>> either set_px_pminfo or set_cppc_pminfo is taken in the runtime.
>>
>> Yet you realize that this implies Dom0 to know what configuration Xen uses, in
>> order to know which data to upload. The best approach might be to have
>> Dom0 upload all data it has, with us merely ignoring what we can't make use of.
> 
> PLZ correct me if I understand you wrongly:
> Right now, I was letting DOM0 upload all data it has, and in the Xen:
> ```
>     case XEN_PM_CPPC:
>         if ( !(xen_processor_pmbits & XEN_PROCESSOR_PM_CPPC) )
>         {
>             ret = -EOPNOTSUPPED;
>             break;
>         }
>         ret = set_cppc_pminfo(op->u.set_pminfo.id,
>                               &op->u.set_pminfo.u.cppc_data);
>         break;
> 
>     case XEN_PM_PX:
>         if ( !(xen_processor_pmbits & XEN_PROCESSOR_PM_PX) )
>         {
>             ret = -EOPNOTSUPPED;
>             break;
>         }
>         ret = set_px_pminfo(op->u.set_pminfo.id, &op->u.set_pminfo.u.perf);
>         break;
> ```
> I relied on flag XEN_PROCESSOR_PM_CPPC and XEN_PROCESSOR_PM_PX to choose which
> info we shall record.
> Firstly, we shall not return -EOPNOTSUPPED error above there.

Yes.

>> The order of uploading (CPPC first or CPPC last) shouldn't matter. Then (and only
>> then, and - ftaod - only when uploading of the "wrong" kind of data doesn't result in
>> an error) things can go without warning.
> 
> Then in
> ```
>     pm_info->init = XEN_CPPC_INIT;
>     ret = cpufreq_cpu_init(cpuid);
> ```
> We shall add warning here to clarify no fallback scheme to replace now, when ret is not zero.

Maybe. In the earlier reply I said with certain conditions fulfilled a warning
may not be necessary. Yet perhaps initially having a warning there (maybe just
for debug builds) may make sense.

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v3 01/15] xen/cpufreq: introduces XEN_PM_PSD for solely delivery of _PSD
  2025-03-24 14:08   ` Jan Beulich
@ 2025-04-01  3:25     ` Penny, Zheng
  0 siblings, 0 replies; 60+ messages in thread
From: Penny, Zheng @ 2025-04-01  3:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Huang, Ray, Andrew Cooper, Roger Pau Monné, Anthony PERARD,
	Orzel, Michal, Julien Grall, Stefano Stabellini,
	xen-devel@lists.xenproject.org

[Public]

Hi,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, March 24, 2025 10:09 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Roger Pau Monné <roger.pau@citrix.com>;
> Anthony PERARD <anthony.perard@vates.tech>; Orzel, Michal
> <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Stefano Stabellini
> <sstabellini@kernel.org>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v3 01/15] xen/cpufreq: introduces XEN_PM_PSD for solely
> delivery of _PSD
>
> On 06.03.2025 09:39, Penny Zheng wrote:
> > --- a/xen/include/public/platform.h
> > +++ b/xen/include/public/platform.h
> > @@ -363,12 +363,12 @@
> DEFINE_XEN_GUEST_HANDLE(xenpf_getidletime_t);
> >  #define XEN_PM_PX   1
> >  #define XEN_PM_TX   2
> >  #define XEN_PM_PDC  3
> > +#define XEN_PM_PSD  4
> >
> >  /* Px sub info type */
> >  #define XEN_PX_PCT   1
> >  #define XEN_PX_PSS   2
> >  #define XEN_PX_PPC   4
> > -#define XEN_PX_PSD   8
> >
> >  struct xen_power_register {
> >      uint32_t     space_id;
> > @@ -439,6 +439,7 @@ struct xen_psd_package {
> >      uint64_t coord_type;
> >      uint64_t num_processors;
> >  };
> > +typedef struct xen_psd_package xen_psd_package_t;
> >
> >  struct xen_processor_performance {
> >      uint32_t flags;     /* flag for Px sub info type */
> > @@ -447,12 +448,6 @@ struct xen_processor_performance {
> >      struct xen_pct_register status_register;
> >      uint32_t state_count;     /* total available performance states */
> >      XEN_GUEST_HANDLE(xen_processor_px_t) states;
> > -    struct xen_psd_package domain_info;
> > -    /* Coordination type of this processor */
> > -#define XEN_CPUPERF_SHARED_TYPE_HW   1 /* HW does needed
> coordination */
> > -#define XEN_CPUPERF_SHARED_TYPE_ALL  2 /* All dependent CPUs
> should
> > set freq */ -#define XEN_CPUPERF_SHARED_TYPE_ANY  3 /* Freq can be set
> from any dependent CPU */
> > -    uint32_t shared_type;
> >  };
> >  typedef struct xen_processor_performance xen_processor_performance_t;
> > DEFINE_XEN_GUEST_HANDLE(xen_processor_performance_t);
> > @@ -463,9 +458,15 @@ struct xenpf_set_processor_pminfo {
> >      uint32_t type;  /* {XEN_PM_CX, XEN_PM_PX} */
> >      union {
> >          struct xen_processor_power          power;/* Cx: _CST/_CSD */
> > -        struct xen_processor_performance    perf; /* Px: _PPC/_PCT/_PSS/_PSD
> */
> > +        xen_psd_package_t                   domain_info; /* _PSD */
> > +        struct xen_processor_performance    perf; /* Px: _PPC/_PCT/_PSS/ */
> >          XEN_GUEST_HANDLE(uint32)            pdc;  /* _PDC */
> >      } u;
> > +    /* Coordination type of this processor */
> > +#define XEN_CPUPERF_SHARED_TYPE_HW   1 /* HW does needed
> coordination */
> > +#define XEN_CPUPERF_SHARED_TYPE_ALL  2 /* All dependent CPUs
> should
> > +set freq */ #define XEN_CPUPERF_SHARED_TYPE_ANY  3 /* Freq can be
> set from any dependent CPU */
> > +    uint32_t shared_type;
> >  };
> >  typedef struct xenpf_set_processor_pminfo
> > xenpf_set_processor_pminfo_t;
> > DEFINE_XEN_GUEST_HANDLE(xenpf_set_processor_pminfo_t);
>
> With this change to stable hypercall structures, how is an older Dom0 kernel going
> to be able to properly upload the necessary data? IOW: No, you can't alter existing
> stable hypercall structures like this.
>

Understood.
I'll expand the newly added "struct xen_processor_cppc", to let it also include _PSD info
and shared type
```
+struct xen_processor_cppc {
+    uint8_t flags; /* flag for CPPC sub info type */
+    /*
+     * Subset _CPC fields useful for CPPC-compatible cpufreq
+     * driver's initialization
+     */
+    struct {
+        uint32_t highest_perf;
+        uint32_t nominal_perf;
+        uint32_t lowest_nonlinear_perf;
+        uint32_t lowest_perf;
+        uint32_t lowest_mhz;
+        uint32_t nominal_mhz;
+    } cpc;
+    struct xen_psd_package domain_info; /* _PSD */
+    /* Coordination type of this processor */
+    uint32_t shared_type;
+};
+typedef struct xen_processor_cppc xen_processor_cppc_t;
```

> Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v3 03/15] xen/cpufreq: refactor cmdline "cpufreq=xxx"
  2025-03-26 10:43       ` Jan Beulich
@ 2025-04-01  5:44         ` Penny, Zheng
  2025-04-01  6:38           ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Penny, Zheng @ 2025-04-01  5:44 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Huang, Ray, Andrew Cooper, Anthony PERARD, Orzel, Michal,
	Julien Grall, Roger Pau Monné, Stefano Stabellini,
	xen-devel@lists.xenproject.org

[Public]

Hi

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Wednesday, March 26, 2025 6:43 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Anthony PERARD <anthony.perard@vates.tech>;
> Orzel, Michal <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Roger
> Pau Monné <roger.pau@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>;
> xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v3 03/15] xen/cpufreq: refactor cmdline "cpufreq=xxx"
>
> On 26.03.2025 08:20, Penny, Zheng wrote:
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Monday, March 24, 2025 11:01 PM
> >>
> >> On 06.03.2025 09:39, Penny Zheng wrote:
> > Maybe I mis-understood the previous comment you said ```
> >         >          else if ( IS_ENABLED(CONFIG_INTEL) && choice < 0 &&
> >         > ```
> >
> >         For the rest of this, I guess I'd prefer to see this in context. Also with
> >         regard to the helper function's name.
> > ```
> > I thought you suggested to introduce helper function to wrap the conditional
> codes...
> > Or may you were suggesting something like:
> > ```
> > #ifdef CONFIG_INTEL
> > else if ( choice < 0 && !cmdline_strcmp(str, "hwp") ) {
> >     xen_processor_pmbits |= XEN_PROCES
> >     ...
> > }
> > #endif
> > ```
>
> Was this reply of yours misplaced? It doesn't fit with the part of my reply in context
> above. Or maybe I'm not understanding what you mean to say.
>
> >> In the end I'm also not entirely convinced that we need these two
> >> almost identical helpers (with a 3rd likely appearing in a later patch).
>
> Instead it feels as if this response of yours was to this part of my comment.
> Indeed iirc I was suggesting to introduce a helper function. Note, however, the
> singular here as well as in your response above.
>

Correct if I understood wrongly, you are suggesting that we shall use one single helper
function here to cover all scenarios, maybe as follows:
```
+static int __init handle_cpufreq_cmdline(const char *arg, const char *end,
+                                         enum cpufreq_xen_opt option)
+{
+    int ret;
+
+    if ( cpufreq_opts_contain(option) )
+    {
+        const char *cpufreq_opts_str[] = { "CPUFREQ_xen", "CPUFREQ_hwp" };
+
+        printk(XENLOG_WARNING
+               "Duplicate cpufreq driver option: %s",
+               cpufreq_opts_str[option - 1]);
+        return 0;
+    }
+
+    xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
+    cpufreq_controller = FREQCTL_xen;
+    cpufreq_xen_opts[cpufreq_xen_cnt++] = option;
+    switch ( option )
+    {
+    case CPUFREQ_hwp:
+        if ( arg[0] && arg[1] )
+            ret = hwp_cmdline_parse(arg + 1, end);
+    case CPUFREQ_xen:
+        if ( arg[0] && arg[1] )
+            ret = cpufreq_cmdline_parse(arg + 1, end);
+    default:
+        ret = -EINVAL;
+    }
+
+    return ret;
+}
```

> Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 03/15] xen/cpufreq: refactor cmdline "cpufreq=xxx"
  2025-04-01  5:44         ` Penny, Zheng
@ 2025-04-01  6:38           ` Jan Beulich
  2025-04-01  6:56             ` Penny, Zheng
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2025-04-01  6:38 UTC (permalink / raw)
  To: Penny, Zheng
  Cc: Huang, Ray, Andrew Cooper, Anthony PERARD, Orzel, Michal,
	Julien Grall, Roger Pau Monné, Stefano Stabellini,
	xen-devel@lists.xenproject.org

On 01.04.2025 07:44, Penny, Zheng wrote:
> [Public]
> 
> Hi
> 
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Wednesday, March 26, 2025 6:43 PM
>> To: Penny, Zheng <penny.zheng@amd.com>
>> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
>> <andrew.cooper3@citrix.com>; Anthony PERARD <anthony.perard@vates.tech>;
>> Orzel, Michal <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Roger
>> Pau Monné <roger.pau@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>;
>> xen-devel@lists.xenproject.org
>> Subject: Re: [PATCH v3 03/15] xen/cpufreq: refactor cmdline "cpufreq=xxx"
>>
>> On 26.03.2025 08:20, Penny, Zheng wrote:
>>>> -----Original Message-----
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: Monday, March 24, 2025 11:01 PM
>>>>
>>>> On 06.03.2025 09:39, Penny Zheng wrote:
>>> Maybe I mis-understood the previous comment you said ```
>>>         >          else if ( IS_ENABLED(CONFIG_INTEL) && choice < 0 &&
>>>         > ```
>>>
>>>         For the rest of this, I guess I'd prefer to see this in context. Also with
>>>         regard to the helper function's name.
>>> ```
>>> I thought you suggested to introduce helper function to wrap the conditional
>> codes...
>>> Or may you were suggesting something like:
>>> ```
>>> #ifdef CONFIG_INTEL
>>> else if ( choice < 0 && !cmdline_strcmp(str, "hwp") ) {
>>>     xen_processor_pmbits |= XEN_PROCES
>>>     ...
>>> }
>>> #endif
>>> ```
>>
>> Was this reply of yours misplaced? It doesn't fit with the part of my reply in context
>> above. Or maybe I'm not understanding what you mean to say.
>>
>>>> In the end I'm also not entirely convinced that we need these two
>>>> almost identical helpers (with a 3rd likely appearing in a later patch).
>>
>> Instead it feels as if this response of yours was to this part of my comment.
>> Indeed iirc I was suggesting to introduce a helper function. Note, however, the
>> singular here as well as in your response above.
>>
> 
> Correct if I understood wrongly, you are suggesting that we shall use one single helper
> function here to cover all scenarios, maybe as follows:
> ```
> +static int __init handle_cpufreq_cmdline(const char *arg, const char *end,
> +                                         enum cpufreq_xen_opt option)
> +{
> +    int ret;
> +
> +    if ( cpufreq_opts_contain(option) )
> +    {
> +        const char *cpufreq_opts_str[] = { "CPUFREQ_xen", "CPUFREQ_hwp" };
> +
> +        printk(XENLOG_WARNING
> +               "Duplicate cpufreq driver option: %s",
> +               cpufreq_opts_str[option - 1]);
> +        return 0;
> +    }
> +
> +    xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
> +    cpufreq_controller = FREQCTL_xen;
> +    cpufreq_xen_opts[cpufreq_xen_cnt++] = option;
> +    switch ( option )
> +    {
> +    case CPUFREQ_hwp:
> +        if ( arg[0] && arg[1] )
> +            ret = hwp_cmdline_parse(arg + 1, end);
> +    case CPUFREQ_xen:
> +        if ( arg[0] && arg[1] )
> +            ret = cpufreq_cmdline_parse(arg + 1, end);
> +    default:
> +        ret = -EINVAL;
> +    }

Apart from the switch() missing all break statements, the helper I was thinking
of would end right before the switch(). The <xyz>_cmdline_parse() calls would
remain at the call sites of the helper.

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v3 03/15] xen/cpufreq: refactor cmdline "cpufreq=xxx"
  2025-04-01  6:38           ` Jan Beulich
@ 2025-04-01  6:56             ` Penny, Zheng
  0 siblings, 0 replies; 60+ messages in thread
From: Penny, Zheng @ 2025-04-01  6:56 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Huang, Ray, Andrew Cooper, Anthony PERARD, Orzel, Michal,
	Julien Grall, Roger Pau Monné, Stefano Stabellini,
	xen-devel@lists.xenproject.org

[Public]

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, April 1, 2025 2:38 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Anthony PERARD <anthony.perard@vates.tech>;
> Orzel, Michal <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Roger
> Pau Monné <roger.pau@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>;
> xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v3 03/15] xen/cpufreq: refactor cmdline "cpufreq=xxx"
>
> On 01.04.2025 07:44, Penny, Zheng wrote:
> > [Public]
> >
> > Hi
> >
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Wednesday, March 26, 2025 6:43 PM
> >> To: Penny, Zheng <penny.zheng@amd.com>
> >> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> >> <andrew.cooper3@citrix.com>; Anthony PERARD
> >> <anthony.perard@vates.tech>; Orzel, Michal <Michal.Orzel@amd.com>;
> >> Julien Grall <julien@xen.org>; Roger Pau Monné
> >> <roger.pau@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>;
> >> xen-devel@lists.xenproject.org
> >> Subject: Re: [PATCH v3 03/15] xen/cpufreq: refactor cmdline "cpufreq=xxx"
> >>
> >> On 26.03.2025 08:20, Penny, Zheng wrote:
> >>>> -----Original Message-----
> >>>> From: Jan Beulich <jbeulich@suse.com>
> >>>> Sent: Monday, March 24, 2025 11:01 PM
> >>>>
> >>>> On 06.03.2025 09:39, Penny Zheng wrote:
> >>> Maybe I mis-understood the previous comment you said ```
> >>>         >          else if ( IS_ENABLED(CONFIG_INTEL) && choice < 0 &&
> >>>         > ```
> >>>
> >>>         For the rest of this, I guess I'd prefer to see this in context. Also with
> >>>         regard to the helper function's name.
> >>> ```
> >>> I thought you suggested to introduce helper function to wrap the
> >>> conditional
> >> codes...
> >>> Or may you were suggesting something like:
> >>> ```
> >>> #ifdef CONFIG_INTEL
> >>> else if ( choice < 0 && !cmdline_strcmp(str, "hwp") ) {
> >>>     xen_processor_pmbits |= XEN_PROCES
> >>>     ...
> >>> }
> >>> #endif
> >>> ```
> >>
> >> Was this reply of yours misplaced? It doesn't fit with the part of my
> >> reply in context above. Or maybe I'm not understanding what you mean to say.
> >>
> >>>> In the end I'm also not entirely convinced that we need these two
> >>>> almost identical helpers (with a 3rd likely appearing in a later patch).
> >>
> >> Instead it feels as if this response of yours was to this part of my comment.
> >> Indeed iirc I was suggesting to introduce a helper function. Note,
> >> however, the singular here as well as in your response above.
> >>
> >
> > Correct if I understood wrongly, you are suggesting that we shall use
> > one single helper function here to cover all scenarios, maybe as follows:
> > ```
> > +static int __init handle_cpufreq_cmdline(const char *arg, const char *end,
> > +                                         enum cpufreq_xen_opt option)
> > +{
> > +    int ret;
> > +
> > +    if ( cpufreq_opts_contain(option) )
> > +    {
> > +        const char *cpufreq_opts_str[] = { "CPUFREQ_xen",
> > + "CPUFREQ_hwp" };
> > +
> > +        printk(XENLOG_WARNING
> > +               "Duplicate cpufreq driver option: %s",
> > +               cpufreq_opts_str[option - 1]);
> > +        return 0;
> > +    }
> > +
> > +    xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
> > +    cpufreq_controller = FREQCTL_xen;
> > +    cpufreq_xen_opts[cpufreq_xen_cnt++] = option;
> > +    switch ( option )
> > +    {
> > +    case CPUFREQ_hwp:
> > +        if ( arg[0] && arg[1] )
> > +            ret = hwp_cmdline_parse(arg + 1, end);
> > +    case CPUFREQ_xen:
> > +        if ( arg[0] && arg[1] )
> > +            ret = cpufreq_cmdline_parse(arg + 1, end);
> > +    default:
> > +        ret = -EINVAL;
> > +    }
>
> Apart from the switch() missing all break statements, the helper I was thinking of
> would end right before the switch(). The <xyz>_cmdline_parse() calls would
> remain at the call sites of the helper.
>

Understood!

> Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v3 09/15] xen/x86: introduce a new amd cppc driver for cpufreq scaling
  2025-03-25  9:57   ` Jan Beulich
  2025-03-25 13:58     ` Jason Andryuk
@ 2025-04-03  7:40     ` Penny, Zheng
  2025-04-03  7:54       ` Jan Beulich
  1 sibling, 1 reply; 60+ messages in thread
From: Penny, Zheng @ 2025-04-03  7:40 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Huang, Ray, Andrew Cooper, Roger Pau Monné,
	xen-devel@lists.xenproject.org, Jason Andryuk

[Public]

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, March 25, 2025 5:58 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Roger Pau Monné <roger.pau@citrix.com>; xen-
> devel@lists.xenproject.org; Jason Andryuk <jandryuk@gmail.com>
> Subject: Re: [PATCH v3 09/15] xen/x86: introduce a new amd cppc driver for
> cpufreq scaling
>
> > +         * directly use nominal_mhz and lowest_mhz as the division
> > +         * will remove the frequency unit.
> > +         */
> > +        div = div ?: 1;
>
> Imo the cppc_data->lowest_mhz >= cppc_data->nominal_mhz case better
> wouldn't make it here, but use the fallback path below. Or special- case
> cppc_data->lowest_mhz == cppc_data->nominal_mhz: mul would
> (hopefully) be zero (i.e. there would be the expectation that
> data->caps.nominal_perf == data->caps.lowest_perf, yet no guarantee
> without checking), and hence ...

Okay, I'll drop the " div = div ?: 1", to strict the if() to
```
if ( cppc_data->cpc.lowest_mhz && cppc_data->cpc.nominal_mhz &&
     (cppc_data->cpc.lowest_mhz != cppc_data->cpc.nominal_mhz) )
```

>
> > +        offset = data->caps.nominal_perf -
> > +                 (mul * cppc_data->nominal_mhz) / div;
>
> ... offset = data->caps.nominal_perf regardless of "div" (as long as that's not zero).
> I.e. the "equal" case may still be fine to take this path.
>
> Or is there a check somewhere that lowest_mhz <= nominal_mhz and lowest_perf
> <= nominal_perf, which I'm simply overlooking?
>

Yes. I overlooked the scenario that lowest_mhz > nominal_mhz and lowest_perf > nominal_perf
and I'll add the check on first read

> > +#define amd_get_freq(name)                                                  \
>
> The macro parameter is used just ...
>
> > +    static int amd_get_##name##_freq(const struct amd_cppc_drv_data
> > + *data,  \
>
> ... here, ...
>
> > +                                     unsigned int *freq)                    \
> > +    {                                                                       \
> > +        const struct xen_processor_cppc *cppc_data = data->cppc_data;       \
> > +        uint64_t mul, div, res;                                             \
> > +                                                                            \
> > +        if ( cppc_data->name##_mhz )                                        \
> > +        {                                                                   \
> > +            /* Switch to khz */                                             \
> > +            *freq = cppc_data->name##_mhz * 1000;                           \
>
> ... twice here forthe MHz value, and ...
>
> > +            return 0;                                                       \
> > +        }                                                                   \
> > +                                                                            \
> > +        /* Read Processor Max Speed(mhz) as anchor point */                 \
> > +        mul = this_cpu(amd_max_freq_mhz);                                   \
> > +        if ( !mul )                                                         \
> > +            return -EINVAL;                                                 \
> > +        div = data->caps.highest_perf;                                      \
> > +        res = (mul * data->caps.name##_perf * 1000) / div;                  \
>
> ... here for the respective perf indicator. Why does it take ...
>
> > +        if ( res > UINT_MAX )                                               \
> > +        {                                                                   \
> > +            printk(XENLOG_ERR                                               \
> > +                   "Frequeny exceeds maximum value UINT_MAX: %lu\n", res);  \
> > +            return -EINVAL;                                                 \
> > +        }                                                                   \
> > +        *freq = (unsigned int)res;                                          \
> > +                                                                            \
> > +        return 0;                                                           \
> > +    }                                                                       \
> > +
> > +amd_get_freq(lowest);
> > +amd_get_freq(nominal);
>
> ... two almost identical functions, when one (with two extra input parameters) would
> suffice?
>

I had a draft fix here, If it doesn't what you hope for, plz let me know
```
static int amd_get_lowest_and_nominal_freq(const struct amd_cppc_drv_data *data,
                                           unsigned int *lowest_freq,
                                           unsigned int *nominal_freq)
{
    const struct xen_processor_cppc *cppc_data = data->cppc_data;
    uint64_t mul, div, res;
    uint8_t perf;

    if ( !lowest_freq && !nominal_freq )
        return -EINVAL;

    if ( lowest_freq && cppc_data->cpc.lowest_mhz )
        /* Switch to khz */
        *lowest_freq = cppc_data->cpc.lowest_mhz * 1000;

    if ( nominal_freq && cppc_data->cpc.nominal_mhz )
        /* Switch to khz */
        *nominal_freq = cppc_data->cpc.nominal_mhz * 1000;

    /* Still have unresolved frequency */
    if ( (lowest_freq && !(*lowest_freq)) ||
         (nominal_freq && !(*nominal_freq)) )
    {
        do {
            /* Calculate lowest frequency firstly if need */
            if ( lowest_freq && !(*lowest_freq) )
                perf = data->caps.lowest_perf;
            else
                perf = data->caps.nominal_perf;

            /* Read Processor Max Speed(MHz) as anchor point */
            mul = this_cpu(amd_max_pxfreq_mhz);
            if ( mul == INVAL_FREQ_MHZ || !mul )
            {
                printk(XENLOG_ERR
                       "Failed to read valid processor max frequency as anchor point: %lu\n",
                       mul);
                return -EINVAL;
            }
            div = data->caps.highest_perf;
            res = (mul * perf * 1000) / div;

            if ( res > UINT_MAX || !res )
            {
                printk(XENLOG_ERR
                       "Frequeny exceeds maximum value UINT_MAX or being zero value: %lu\n",
                       res);
                return -EINVAL;
            }

            if ( lowest_freq && !(*lowest_freq) )
                *lowest_freq = (unsigned int)res;
            else
                *nominal_freq = (unsigned int)res;
        } while ( nominal_freq && !(*nominal_freq) );
    }

    return 0;
}
```

> In amd_cppc_khz_to_perf() you have a check to avoid division by zero. Why not
> the same safeguarding here?
>

div = data->caps.highest_perf; For highest_perf non-zero check, it is already added
in  amd_cppc_init_msrs()

> > +static int amd_get_max_freq(const struct amd_cppc_drv_data *data,
> > +                            unsigned int *max_freq) {
> > +    unsigned int nom_freq, boost_ratio;
> > +    int res;
> > +
> > +    res = amd_get_nominal_freq(data, &nom_freq);
> > +    if ( res )
> > +        return res;
> > +
> > +    boost_ratio = (unsigned int)(data->caps.highest_perf /
> > +                                 data->caps.nominal_perf);
>
> Similarly here - I can't spot what would prevent division by zero.
>

In amd_cppc_init_msrs(), before calculating the frequency, we have checked
all caps.xxx_perf info shall not be zero.
I'll complement check to avoid "highest_perf < nominal_perf", to ensure that
the calculation result of boost_ratio must not be zero.
```
    if ( data->caps.highest_perf == 0 || data->caps.lowest_perf == 0 ||
         data->caps.nominal_perf == 0 || data->caps.lowest_nonlinear_perf == 0 ||
         data->caps.lowest_perf > data->caps.lowest_nonlinear_perf ||
         data->caps.lowest_nonlinear_perf > data->caps.nominal_perf ||
         data->caps.nominal_perf > data->caps.highest_perf )
```

>
> Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 09/15] xen/x86: introduce a new amd cppc driver for cpufreq scaling
  2025-04-03  7:40     ` Penny, Zheng
@ 2025-04-03  7:54       ` Jan Beulich
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2025-04-03  7:54 UTC (permalink / raw)
  To: Penny, Zheng
  Cc: Huang, Ray, Andrew Cooper, Roger Pau Monné,
	xen-devel@lists.xenproject.org, Jason Andryuk

On 03.04.2025 09:40, Penny, Zheng wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Tuesday, March 25, 2025 5:58 PM
>>
>>> +#define amd_get_freq(name)                                                  \
>>
>> The macro parameter is used just ...
>>
>>> +    static int amd_get_##name##_freq(const struct amd_cppc_drv_data
>>> + *data,  \
>>
>> ... here, ...
>>
>>> +                                     unsigned int *freq)                    \
>>> +    {                                                                       \
>>> +        const struct xen_processor_cppc *cppc_data = data->cppc_data;       \
>>> +        uint64_t mul, div, res;                                             \
>>> +                                                                            \
>>> +        if ( cppc_data->name##_mhz )                                        \
>>> +        {                                                                   \
>>> +            /* Switch to khz */                                             \
>>> +            *freq = cppc_data->name##_mhz * 1000;                           \
>>
>> ... twice here forthe MHz value, and ...
>>
>>> +            return 0;                                                       \
>>> +        }                                                                   \
>>> +                                                                            \
>>> +        /* Read Processor Max Speed(mhz) as anchor point */                 \
>>> +        mul = this_cpu(amd_max_freq_mhz);                                   \
>>> +        if ( !mul )                                                         \
>>> +            return -EINVAL;                                                 \
>>> +        div = data->caps.highest_perf;                                      \
>>> +        res = (mul * data->caps.name##_perf * 1000) / div;                  \
>>
>> ... here for the respective perf indicator. Why does it take ...
>>
>>> +        if ( res > UINT_MAX )                                               \
>>> +        {                                                                   \
>>> +            printk(XENLOG_ERR                                               \
>>> +                   "Frequeny exceeds maximum value UINT_MAX: %lu\n", res);  \
>>> +            return -EINVAL;                                                 \
>>> +        }                                                                   \
>>> +        *freq = (unsigned int)res;                                          \
>>> +                                                                            \
>>> +        return 0;                                                           \
>>> +    }                                                                       \
>>> +
>>> +amd_get_freq(lowest);
>>> +amd_get_freq(nominal);
>>
>> ... two almost identical functions, when one (with two extra input parameters) would
>> suffice?
>>
> 
> I had a draft fix here, If it doesn't what you hope for, plz let me know
> ```
> static int amd_get_lowest_and_nominal_freq(const struct amd_cppc_drv_data *data,
>                                            unsigned int *lowest_freq,
>                                            unsigned int *nominal_freq)

Why two outputs now when there was just one in the macro-ized form? I was
rather expecting new inputs to appear, to account for the prior uses of
the macro parameter. (As a result the function is now also quite a bit
more complex than it was before. In particular there was no ...

> {
>     const struct xen_processor_cppc *cppc_data = data->cppc_data;
>     uint64_t mul, div, res;
>     uint8_t perf;
> 
>     if ( !lowest_freq && !nominal_freq )
>         return -EINVAL;
> 
>     if ( lowest_freq && cppc_data->cpc.lowest_mhz )
>         /* Switch to khz */
>         *lowest_freq = cppc_data->cpc.lowest_mhz * 1000;
> 
>     if ( nominal_freq && cppc_data->cpc.nominal_mhz )
>         /* Switch to khz */
>         *nominal_freq = cppc_data->cpc.nominal_mhz * 1000;
> 
>     /* Still have unresolved frequency */
>     if ( (lowest_freq && !(*lowest_freq)) ||
>          (nominal_freq && !(*nominal_freq)) )
>     {
>         do {
>             /* Calculate lowest frequency firstly if need */
>             if ( lowest_freq && !(*lowest_freq) )
>                 perf = data->caps.lowest_perf;
>             else
>                 perf = data->caps.nominal_perf;
> 
>             /* Read Processor Max Speed(MHz) as anchor point */
>             mul = this_cpu(amd_max_pxfreq_mhz);
>             if ( mul == INVAL_FREQ_MHZ || !mul )
>             {
>                 printk(XENLOG_ERR
>                        "Failed to read valid processor max frequency as anchor point: %lu\n",
>                        mul);
>                 return -EINVAL;
>             }
>             div = data->caps.highest_perf;
>             res = (mul * perf * 1000) / div;
> 
>             if ( res > UINT_MAX || !res )
>             {
>                 printk(XENLOG_ERR
>                        "Frequeny exceeds maximum value UINT_MAX or being zero value: %lu\n",
>                        res);
>                 return -EINVAL;
>             }
> 
>             if ( lowest_freq && !(*lowest_freq) )
>                 *lowest_freq = (unsigned int)res;
>             else
>                 *nominal_freq = (unsigned int)res;
>         } while ( nominal_freq && !(*nominal_freq) );

... loop there.)

Jan

>     }
> 
>     return 0;
> }
> ```



^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v3 12/15] xen/x86: implement EPP support for the amd-cppc driver in active mode
  2025-03-28  7:18       ` Jan Beulich
@ 2025-04-08 10:32         ` Penny, Zheng
  0 siblings, 0 replies; 60+ messages in thread
From: Penny, Zheng @ 2025-04-08 10:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Huang, Ray, Andrew Cooper, Anthony PERARD, Orzel, Michal,
	Julien Grall, Roger Pau Monné, Stefano Stabellini,
	xen-devel@lists.xenproject.org

[Public]

Hi,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Friday, March 28, 2025 3:18 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Anthony PERARD <anthony.perard@vates.tech>;
> Orzel, Michal <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Roger
> Pau Monné <roger.pau@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>;
> xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v3 12/15] xen/x86: implement EPP support for the amd-cppc
> driver in active mode
>
> On 28.03.2025 05:07, Penny, Zheng wrote:
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Tuesday, March 25, 2025 6:49 PM
> >>
> >> On 06.03.2025 09:39, Penny Zheng wrote:
>
> >>> +    {
> >>> +        /* Force the epp value to be zero for performance policy */
> >>> +        epp = CPPC_ENERGY_PERF_MAX_PERFORMANCE;
> >>> +        min_perf = max_perf;
> >>> +    }
> >>> +    else if ( policy->policy == CPUFREQ_POLICY_POWERSAVE )
> >>> +        /* Force the epp value to be 0xff for powersave policy */
> >>> +        /*
> >>> +         * If set max_perf = min_perf = lowest_perf, we are putting
> >>> +         * cpu cores in idle.
> >>> +         */
> >>
> >> Nit: Such two successive comments want combining. (Same near the top
> >> of the function, as I notice only now.)
> >>
> >> Furthermore I'm in trouble with interpreting this comment: To me "lowest"
> >> doesn't mean "doing nothing" but "doing things as efficiently in
> >> terms of power use as possible". IOW that's not idle. Yet the comment
> >> reads as if it was meant to be an explanation of why we can't set
> >> max_perf from min_perf here. That is, not matter what's meant to be
> >> said, I think this needs re- wording (and possibly using subjunctive mood).
> >
> > How about:
> > The lowest non-linear perf is equivalent as P2 frequency. Reducing
> > performance below this point does not lead to total energy savings for a given
> computation (although it reduces momentary power).
> > So we are not suggesting to set max_perf smaller than lowest non-linear perf, or
> even the lowest perf.
>
> In an abstract way I think I can follow this. In the context of the code being
> commented, however, I'm afraid I still can't make sense of it. Main point being that
> the code commented doesn't use any of the *_perf values. It only sets the "epp"
> local variable. Maybe the point of the comment is to explain why non of the *_perf
> are used here, but I can't read this out of either of the proposed texts.
>

I've checked some internal test suites for CPPC in windows. Maybe setting max_perf = nominal_perf
is a fair option for powersave mode

> Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2025-04-08 10:33 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-06  8:39 [PATCH v3 00/15] amd-cppc CPU Performance Scaling Driver Penny Zheng
2025-03-06  8:39 ` [PATCH v3 01/15] xen/cpufreq: introduces XEN_PM_PSD for solely delivery of _PSD Penny Zheng
2025-03-24 14:08   ` Jan Beulich
2025-04-01  3:25     ` Penny, Zheng
2025-03-06  8:39 ` [PATCH v3 02/15] xen/x86: introduce new sub-hypercall to propagate CPPC data Penny Zheng
2025-03-24 14:28   ` Jan Beulich
2025-03-25  4:12     ` Penny, Zheng
2025-03-25  7:53       ` Jan Beulich
2025-03-28  8:27         ` Penny, Zheng
2025-03-28  8:36           ` Jan Beulich
2025-03-06  8:39 ` [PATCH v3 03/15] xen/cpufreq: refactor cmdline "cpufreq=xxx" Penny Zheng
2025-03-24 15:00   ` Jan Beulich
2025-03-26  7:20     ` Penny, Zheng
2025-03-26 10:43       ` Jan Beulich
2025-04-01  5:44         ` Penny, Zheng
2025-04-01  6:38           ` Jan Beulich
2025-04-01  6:56             ` Penny, Zheng
2025-03-06  8:39 ` [PATCH v3 04/15] xen/cpufreq: move XEN_PROCESSOR_PM_xxx to internal header Penny Zheng
2025-03-24 15:11   ` Jan Beulich
2025-03-26  7:48     ` Penny, Zheng
2025-03-06  8:39 ` [PATCH v3 05/15] xen/x86: introduce "cpufreq=amd-cppc" xen cmdline Penny Zheng
2025-03-24 15:26   ` Jan Beulich
2025-03-26  8:35     ` Penny, Zheng
2025-03-26 10:55       ` Jan Beulich
2025-03-27  3:12         ` Penny, Zheng
2025-03-27  7:48           ` Jan Beulich
2025-03-28  4:43             ` Penny, Zheng
2025-03-25 10:00   ` Jan Beulich
2025-03-06  8:39 ` [PATCH v3 06/15] xen/cpufreq: disable px statistic info in amd-cppc mode Penny Zheng
2025-03-24 15:34   ` Jan Beulich
2025-03-06  8:39 ` [PATCH v3 07/15] xen/cpufreq: fix core frequency calculation for AMD Family 1Ah CPUs Penny Zheng
2025-03-24 15:47   ` Jan Beulich
2025-03-25 10:55     ` Nicola Vetrini
2025-03-26  9:54     ` Penny, Zheng
2025-03-26 10:14       ` Nicola Vetrini
2025-03-26 10:19         ` Jan Beulich
2025-03-06  8:39 ` [PATCH v3 08/15] xen/amd: export processor max frequency value Penny Zheng
2025-03-24 15:52   ` Jan Beulich
2025-03-27  8:38     ` Penny, Zheng
2025-03-06  8:39 ` [PATCH v3 09/15] xen/x86: introduce a new amd cppc driver for cpufreq scaling Penny Zheng
2025-03-25  9:57   ` Jan Beulich
2025-03-25 13:58     ` Jason Andryuk
2025-04-03  7:40     ` Penny, Zheng
2025-04-03  7:54       ` Jan Beulich
2025-03-06  8:39 ` [PATCH v3 10/15] xen/cpufreq: only set gov NULL when cpufreq_driver.setpolicy is NULL Penny Zheng
2025-03-24 16:32   ` Jan Beulich
2025-03-06  8:39 ` [PATCH v3 11/15] xen/cpufreq: abstract Energy Performance Preference value Penny Zheng
2025-03-25 10:13   ` Jan Beulich
2025-03-06  8:39 ` [PATCH v3 12/15] xen/x86: implement EPP support for the amd-cppc driver in active mode Penny Zheng
2025-03-25 10:48   ` Jan Beulich
2025-03-28  4:07     ` Penny, Zheng
2025-03-28  7:18       ` Jan Beulich
2025-04-08 10:32         ` Penny, Zheng
2025-03-06  8:39 ` [PATCH v3 13/15] tools/xenpm: Print CPPC parameters for amd-cppc driver Penny Zheng
2025-03-06  8:39 ` [PATCH v3 14/15] xen/xenpm: Adapt cpu frequency monitor in xenpm Penny Zheng
2025-03-25 11:26   ` Jan Beulich
2025-03-25 16:37     ` Jason Andryuk
2025-03-26 15:45     ` Anthony PERARD
2025-03-06  8:39 ` [PATCH v3 15/15] xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC xen_sysctl_pm_op for amd-cppc driver Penny Zheng
2025-03-25 16:59   ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.