* [PATCH v6 0/2] Improve VM CPUfreq and task placement behavior
@ 2024-05-21 4:30 David Dai
2024-05-21 4:30 ` [PATCH v6 1/2] dt-bindings: cpufreq: add virtual cpufreq device David Dai
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: David Dai @ 2024-05-21 4:30 UTC (permalink / raw)
To: Rafael J. Wysocki, Viresh Kumar, Rob Herring, Krzysztof Kozlowski,
Conor Dooley, Sudeep Holla, David Dai, Saravana Kannan
Cc: Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann,
Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm,
devicetree, linux-kernel
Hi,
This patch series is a continuation of the talk Saravana gave at LPC 2022
titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
of the talk is that workloads running in a guest VM get terrible task
placement and CPUfreq behavior when compared to running the same workload
in the host. Effectively, no EAS(Energy Aware Scheduling) for threads
inside VMs. This would make power and performance terrible just by running
the workload in a VM even if we assume there is zero virtualization
overhead.
With this series, a workload running in a VM gets the same task placement
and CPUfreq behavior as it would when running in the host.
The idea is to improve VM CPUfreq/sched behavior by:
- Having guest kernel do accurate load tracking by taking host CPU
arch/type and frequency into account.
- Sharing vCPU frequency requirements with the host so that the
host can do proper frequency scaling and task placement on the host side.
Based on feedback from RFC v1 proposal[4], we've revised our
implementation to using MMIO reads and writes to pass information
from/to host instead of using hypercalls. In our example, the
VMM(Virtual Machine Manager) translates the frequency requests into
Uclamp_min and applies it to the vCPU thread as a hint to the host
kernel.
To achieve the results below, configure the host to:
- Affine vCPUs to specific clusters.
- Set vCPU capacity to match the host CPU they are running on.
To make it easy for folks to try this out with CrosVM, we have put up
userspace patches[5][6]. With those patches, you can configure CrosVM
correctly by adding the options "--host-cpu-topology" and "--virt-cpufreq".
Results:
========
Here are some side-by-side comparisons of RFC v1 proposal vs the current
patch series and are labelled as follows.
- (RFC v1) UtilHyp = hypercall + util_guest
- (current) UClampMMIO = MMIO + UClamp_min
Use cases running a minimal system inside a VM on a Pixel 6:
============================================================
FIO
Higher is better
+-------------------+----------+---------+--------+------------+--------+
| Usecase(avg MB/s) | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Seq Write | 13.3 | 16.4 | +23% | 13.6 | +2% |
+-------------------+----------+---------+--------+------------+--------+
| Rand Write | 11.2 | 12.9 | +15% | 11.8 | +8% |
+-------------------+----------+---------+--------+------------+--------+
| Seq Read | 100 | 168 | +68% | 138 | +38% |
+-------------------+----------+---------+--------+------------+--------+
| Rand Read | 20.5 | 35.6 | +74% | 31.0 | +51% |
+-------------------+----------+---------+--------+------------+--------+
CPU-based ML Inference Benchmark
Lower is better
+----------------+----------+------------+--------+------------+--------+
| Test Case (ms) | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+----------------+----------+------------+--------+------------+--------+
| Cached Sample | | | | | |
| Inference | 3.40 | 2.37 | -30% | 2.99 | -12% |
+----------------+----------+------------+--------+------------+--------+
| Small Sample | | | | | |
| Inference | 9.87 | 6.78 | -31% | 7.65 | -22% |
+----------------+----------+------------+--------+------------+--------+
| Large Sample | | | | | |
| Inference | 33.35 | 26.74 | -20% | 31.05 | -7% |
+----------------+----------+------------+--------+------------+--------+
Use cases running Android inside a VM on a Chromebook:
======================================================
PCMark (Emulates real world usecases)
Higher is better
+-------------------+----------+---------+--------+------------+--------+
| Test Case (score) | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Weighted Total | 6190 | 7442 | +20% | 7171 | +16% |
+-------------------+----------+---------+--------+------------+--------+
| Web Browsing | 5461 | 6620 | +21% | 6284 | +15% |
+-------------------+----------+---------+--------+------------+--------+
| Video Editing | 4891 | 5376 | +10% | 5344 | +9% |
+-------------------+----------+---------+--------+------------+--------+
| Writing | 6929 | 8791 | +27% | 8457 | +22% |
+-------------------+----------+---------+--------+------------+--------+
| Photo Editing | 7966 | 12057 | +51% | 11881 | +49% |
+-------------------+----------+---------+--------+------------+--------+
| Data Manipulation | 5596 | 6057 | +8% | 5694 | +2% |
+-------------------+----------+---------+--------+------------+--------+
PCMark Performance/mAh
Higher is better
+-------------------+----------+---------+--------+------------+--------+
| | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Score/mAh | 87 | 100 | +15% | 92 | +5% |
+-------------------+----------+---------+--------+------------+--------+
Roblox
Higher is better
+-------------------+----------+---------+--------+------------+--------+
| | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| FPS | 17.92 | 21.82 | +22% | 20.02 | +12% |
+-------------------+----------+---------+--------+------------+--------+
Roblox Frames/mAh
Higher is better
+-------------------+----------+---------+--------+------------+--------+
| | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Frames/mAh | 77.91 | 84.46 | +8% | 81.71 | 5% |
+-------------------+----------+---------+--------+------------+--------+
We've simplified our implementation based on community feedback to make
it less intrusive and to use a more generic MMIO interface for
communication with the host. The results show that the current design
still has tangible improvements over baseline. We'll continue looking
into ways to reduce the overhead of the MMIO read/writes and submit
separate and generic patches for that if we find any good optimizations.
Thanks,
David & Saravana
Cc: Saravana Kannan <saravanak@google.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Masami Hiramatsu <mhiramat@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Pavan Kondeti <quic_pkondeti@quicinc.com>
Cc: Gupta Pankaj <pankaj.gupta@amd.com>
Cc: Mel Gorman <mgorman@suse.de>
v5 -> v6:
-Renamed dt-binding documentation file to match compatible string
-Removed opp bindings from dt-binding examples
-Added register interface description as a comment in driver
-Performance info now initialized from the device via MMIO instead
of opp bindings
-Updated driver to use perf tables or max perf depending on VMM
-Added initialization for sharing perf domain
-Updated driver to use .target instead of .target_index
-Updated .verify to handle both perf tables and max perf case
-Updated Kconfig dependencies
v4 -> v5:
-Added dt-binding description to allow for normalized frequencies
-Updated dt-binding examples with normalized frequency values
-Updated cpufreq exit to use dev_pm_opp_free_cpufreq_table to free tables
-Updated fast_switch and target_index to use entries from cpufreq tables
-Refreshed benchmark numbers using indexed frequencies
-Added missing header that was indirectly being used
v3 -> v4:
-Fixed dt-binding formatting issues
-Added additional dt-binding descriptions for “HW interfaces”
-Changed dt-binding to “qemu,virtual-cpufreq”
-Fixed Kconfig formatting issues
-Removed frequency downscaling when requesting frequency updates
-Removed ops and cpufreq driver data
-Added check to limit freq_scale to 1024
-Added KHZ in the register offset naming
-Added comments to explain FIE and not allowing dvfs_possible_from_any_cpu
v2 -> v3:
- Dropped patches adding new hypercalls
- Dropped patch adding util_guest in sched/fair
- Cpufreq driver now populates frequency using opp bindings
- Removed transition_delay_us=1 cpufreq setting as it was configured too
agressively and resulted in poor I/O performance
- Modified guest cpufreq driver to read/write MMIO regions instead of
using hypercalls to communicate with the host
- Modified guest cpufreq driver to pass frequency info instead of
utilization of the current vCPU's runqueue which now takes
iowait_boost into account from the schedutil governor
- Updated DT bindings for a virtual CPU frequency device
Userspace changes:
- Updated CrosVM patches to emulate a virtual cpufreq device
- Updated to newer userspace binaries when collecting more recent
benchmark data
v1 -> v2:
- No functional changes.
- Added description for EAS and removed DVFS in coverletter.
- Added a v2 tag to the subject.
- Fixed up the inconsistent "units" between tables.
- Made sure everyone is To/Cc-ed for all the patches in the series.
[1] - https://lpc.events/event/16/contributions/1195/
[2] - https://lpc.events/event/16/contributions/1195/attachments/970/1893/LPC%202022%20-%20VM%20DVFS.pdf
[3] - https://www.youtube.com/watch?v=hIg_5bg6opU
[4] - https://lore.kernel.org/all/20230331014356.1033759-1-davidai@google.com/
[5] - https://chromium-review.googlesource.com/c/crosvm/crosvm/+/4208668
[6] - https://chromium-review.googlesource.com/q/topic:%22virtcpufreq-v6%22
David Dai (2):
dt-bindings: cpufreq: add virtual cpufreq device
cpufreq: add virtual-cpufreq driver
.../cpufreq/qemu,virtual-cpufreq.yaml | 48 +++
drivers/cpufreq/Kconfig | 14 +
drivers/cpufreq/Makefile | 1 +
drivers/cpufreq/virtual-cpufreq.c | 335 ++++++++++++++++++
include/linux/arch_topology.h | 1 +
5 files changed, 399 insertions(+)
create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,virtual-cpufreq.yaml
create mode 100644 drivers/cpufreq/virtual-cpufreq.c
--
2.45.0.215.g3402c0e53f-goog
^ permalink raw reply [flat|nested] 13+ messages in thread* [PATCH v6 1/2] dt-bindings: cpufreq: add virtual cpufreq device 2024-05-21 4:30 [PATCH v6 0/2] Improve VM CPUfreq and task placement behavior David Dai @ 2024-05-21 4:30 ` David Dai 2024-05-22 15:00 ` Rob Herring (Arm) 2024-05-21 4:30 ` [PATCH v6 2/2] cpufreq: add virtual-cpufreq driver David Dai 2024-05-21 7:41 ` [PATCH v6 0/2] Improve VM CPUfreq and task placement behavior Viresh Kumar 2 siblings, 1 reply; 13+ messages in thread From: David Dai @ 2024-05-21 4:30 UTC (permalink / raw) To: Rafael J. Wysocki, Viresh Kumar, Rob Herring, Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, David Dai, Saravana Kannan Cc: Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel Adding bindings to represent a virtual cpufreq device. Virtual machines may expose MMIO regions for a virtual cpufreq device for guests to read performance information or to request performance selection. The virtual cpufreq device has an individual controller for each performance domain. Performance points for a given domain can be normalized across all domains for ease of allowing for virtual machines to migrate between hosts. Co-developed-by: Saravana Kannan <saravanak@google.com> Signed-off-by: Saravana Kannan <saravanak@google.com> Signed-off-by: David Dai <davidai@google.com> --- .../cpufreq/qemu,virtual-cpufreq.yaml | 48 +++++++++++++++++++ 1 file changed, 48 insertions(+) create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,virtual-cpufreq.yaml diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,virtual-cpufreq.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,virtual-cpufreq.yaml new file mode 100644 index 000000000000..018d98bcdc82 --- /dev/null +++ b/Documentation/devicetree/bindings/cpufreq/qemu,virtual-cpufreq.yaml @@ -0,0 +1,48 @@ +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/cpufreq/qemu,virtual-cpufreq.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Virtual CPUFreq + +maintainers: + - David Dai <davidai@google.com> + - Saravana Kannan <saravanak@google.com> + +description: + Virtual CPUFreq is a virtualized driver in guest kernels that sends performance + selection of its vCPUs as a hint to the host through MMIO regions. Each vCPU + is associated with a performance domain which can be shared with other vCPUs. + Each performance domain has its own set of registers for performance controls. + +properties: + compatible: + const: qemu,virtual-cpufreq + + reg: + maxItems: 1 + description: + Address and size of region containing performance controls for each of the + performance domains. Regions for each performance domain is placed + contiguously and contain registers for controlling DVFS(Dynamic Frequency + and Voltage) characteristics. The size of the region is proportional to + total number of performance domains. + +required: + - compatible + - reg + +additionalProperties: false + +examples: + - | + soc { + #address-cells = <1>; + #size-cells = <1>; + + cpufreq@1040000 { + compatible = "qemu,virtual-cpufreq"; + reg = <0x1040000 0x2000>; + }; + }; -- 2.45.0.215.g3402c0e53f-goog ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v6 1/2] dt-bindings: cpufreq: add virtual cpufreq device 2024-05-21 4:30 ` [PATCH v6 1/2] dt-bindings: cpufreq: add virtual cpufreq device David Dai @ 2024-05-22 15:00 ` Rob Herring (Arm) 0 siblings, 0 replies; 13+ messages in thread From: Rob Herring (Arm) @ 2024-05-22 15:00 UTC (permalink / raw) To: David Dai Cc: Will Deacon, Sudeep Holla, Vincent Guittot, Pavan Kondeti, devicetree, Dietmar Eggemann, Viresh Kumar, Krzysztof Kozlowski, Saravana Kannan, Peter Zijlstra, Conor Dooley, Oliver Upton, linux-pm, linux-kernel, Masami Hiramatsu, Quentin Perret, Gupta Pankaj, Marc Zyngier, Mel Gorman, Rafael J. Wysocki, kernel-team On Mon, 20 May 2024 21:30:51 -0700, David Dai wrote: > Adding bindings to represent a virtual cpufreq device. > > Virtual machines may expose MMIO regions for a virtual cpufreq device > for guests to read performance information or to request performance > selection. The virtual cpufreq device has an individual controller for > each performance domain. Performance points for a given domain can be > normalized across all domains for ease of allowing for virtual machines > to migrate between hosts. > > Co-developed-by: Saravana Kannan <saravanak@google.com> > Signed-off-by: Saravana Kannan <saravanak@google.com> > Signed-off-by: David Dai <davidai@google.com> > --- > .../cpufreq/qemu,virtual-cpufreq.yaml | 48 +++++++++++++++++++ > 1 file changed, 48 insertions(+) > create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,virtual-cpufreq.yaml > Reviewed-by: Rob Herring (Arm) <robh@kernel.org> ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v6 2/2] cpufreq: add virtual-cpufreq driver 2024-05-21 4:30 [PATCH v6 0/2] Improve VM CPUfreq and task placement behavior David Dai 2024-05-21 4:30 ` [PATCH v6 1/2] dt-bindings: cpufreq: add virtual cpufreq device David Dai @ 2024-05-21 4:30 ` David Dai 2024-06-27 21:22 ` David Dai 2024-06-28 12:51 ` Sudeep Holla 2024-05-21 7:41 ` [PATCH v6 0/2] Improve VM CPUfreq and task placement behavior Viresh Kumar 2 siblings, 2 replies; 13+ messages in thread From: David Dai @ 2024-05-21 4:30 UTC (permalink / raw) To: Rafael J. Wysocki, Viresh Kumar, Rob Herring, Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, David Dai, Saravana Kannan Cc: Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel Introduce a virtualized cpufreq driver for guest kernels to improve performance and power of workloads within VMs. This driver does two main things: 1. Sends the frequency of vCPUs as a hint to the host. The host uses the hint to schedule the vCPU threads and decide physical CPU frequency. 2. If a VM does not support a virtualized FIE(like AMUs), it queries the host CPU frequency by reading a MMIO region of a virtual cpufreq device to update the guest's frequency scaling factor periodically. This enables accurate Per-Entity Load Tracking for tasks running in the guest. Co-developed-by: Saravana Kannan <saravanak@google.com> Signed-off-by: Saravana Kannan <saravanak@google.com> Signed-off-by: David Dai <davidai@google.com> --- drivers/cpufreq/Kconfig | 14 ++ drivers/cpufreq/Makefile | 1 + drivers/cpufreq/virtual-cpufreq.c | 335 ++++++++++++++++++++++++++++++ include/linux/arch_topology.h | 1 + 4 files changed, 351 insertions(+) create mode 100644 drivers/cpufreq/virtual-cpufreq.c diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index 94e55c40970a..9aa86bec5793 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -217,6 +217,20 @@ config CPUFREQ_DT If in doubt, say N. +config CPUFREQ_VIRT + tristate "Virtual cpufreq driver" + depends on OF && GENERIC_ARCH_TOPOLOGY + help + This adds a virtualized cpufreq driver for guest kernels that + read/writes to a MMIO region for a virtualized cpufreq device to + communicate with the host. It sends performance requests to the host + which gets used as a hint to schedule vCPU threads and select CPU + frequency. If a VM does not support a virtualized FIE such as AMUs, + it updates the frequency scaling factor by polling host CPU frequency + to enable accurate Per-Entity Load Tracking for tasks running in the guest. + + If in doubt, say N. + config CPUFREQ_DT_PLATDEV tristate "Generic DT based cpufreq platdev driver" depends on OF diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile index 8d141c71b016..eb72ecdc24db 100644 --- a/drivers/cpufreq/Makefile +++ b/drivers/cpufreq/Makefile @@ -16,6 +16,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_ATTR_SET) += cpufreq_governor_attr_set.o obj-$(CONFIG_CPUFREQ_DT) += cpufreq-dt.o obj-$(CONFIG_CPUFREQ_DT_PLATDEV) += cpufreq-dt-platdev.o +obj-$(CONFIG_CPUFREQ_VIRT) += virtual-cpufreq.o # Traces CFLAGS_amd-pstate-trace.o := -I$(src) diff --git a/drivers/cpufreq/virtual-cpufreq.c b/drivers/cpufreq/virtual-cpufreq.c new file mode 100644 index 000000000000..59ce2bda3913 --- /dev/null +++ b/drivers/cpufreq/virtual-cpufreq.c @@ -0,0 +1,335 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2024 Google LLC + */ + +#include <linux/arch_topology.h> +#include <linux/cpufreq.h> +#include <linux/init.h> +#include <linux/sched.h> +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/of_address.h> +#include <linux/of_platform.h> +#include <linux/platform_device.h> +#include <linux/slab.h> + +/* + * CPU0..CPUn + * +-------------+-------------------------------+--------+-------+ + * | Register | Description | Offset | Len | + * +-------------+-------------------------------+--------+-------+ + * | cur_perf | read this register to get | 0x0 | 0x4 | + * | | the current perf (integer val | | | + * | | representing perf relative to | | | + * | | max performance) | | | + * | | that vCPU is running at | | | + * +-------------+-------------------------------+--------+-------+ + * | set_perf | write to this register to set | 0x4 | 0x4 | + * | | perf value of the vCPU | | | + * +-------------+-------------------------------+--------+-------+ + * | perftbl_len | number of entries in perf | 0x8 | 0x4 | + * | | table. A single entry in the | | | + * | | perf table denotes no table | | | + * | | and the entry contains | | | + * | | the maximum perf value | | | + * | | that this vCPU supports. | | | + * | | The guest can request any | | | + * | | value between 1 and max perf | | | + * | | when perftbls are not used. | | | + * +---------------------------------------------+--------+-------+ + * | perftbl_sel | write to this register to | 0xc | 0x4 | + * | | select perf table entry to | | | + * | | read from | | | + * +---------------------------------------------+--------+-------+ + * | perftbl_rd | read this register to get | 0x10 | 0x4 | + * | | perf value of the selected | | | + * | | entry based on perftbl_sel | | | + * +---------------------------------------------+--------+-------+ + * | perf_domain | performance domain number | 0x14 | 0x4 | + * | | that this vCPU belongs to. | | | + * | | vCPUs sharing the same perf | | | + * | | domain number are part of the | | | + * | | same performance domain. | | | + * +-------------+-------------------------------+--------+-------+ + */ + +#define REG_CUR_PERF_STATE_OFFSET 0x0 +#define REG_SET_PERF_STATE_OFFSET 0x4 +#define REG_PERFTBL_LEN_OFFSET 0x8 +#define REG_PERFTBL_SEL_OFFSET 0xc +#define REG_PERFTBL_RD_OFFSET 0x10 +#define REG_PERF_DOMAIN_OFFSET 0x14 +#define PER_CPU_OFFSET 0x1000 + +#define PERFTBL_MAX_ENTRIES 64U + +static void __iomem *base; +static DEFINE_PER_CPU(u32, perftbl_num_entries); + +static void virt_scale_freq_tick(void) +{ + int cpu = smp_processor_id(); + u32 max_freq = (u32)cpufreq_get_hw_max_freq(cpu); + u64 cur_freq; + unsigned long scale; + + cur_freq = (u64)readl_relaxed(base + cpu * PER_CPU_OFFSET + + REG_CUR_PERF_STATE_OFFSET); + + cur_freq <<= SCHED_CAPACITY_SHIFT; + scale = (unsigned long)div_u64(cur_freq, max_freq); + scale = min(scale, SCHED_CAPACITY_SCALE); + + this_cpu_write(arch_freq_scale, scale); +} + +static struct scale_freq_data virt_sfd = { + .source = SCALE_FREQ_SOURCE_VIRT, + .set_freq_scale = virt_scale_freq_tick, +}; + +static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy, + unsigned int target_freq) +{ + writel_relaxed(target_freq, + base + policy->cpu * PER_CPU_OFFSET + REG_SET_PERF_STATE_OFFSET); + return 0; +} + +static unsigned int virt_cpufreq_fast_switch(struct cpufreq_policy *policy, + unsigned int target_freq) +{ + virt_cpufreq_set_perf(policy, target_freq); + return target_freq; +} + +static u32 virt_cpufreq_get_perftbl_entry(int cpu, u32 idx) +{ + writel_relaxed(idx, base + cpu * PER_CPU_OFFSET + + REG_PERFTBL_SEL_OFFSET); + return readl_relaxed(base + cpu * PER_CPU_OFFSET + + REG_PERFTBL_RD_OFFSET); +} + +static int virt_cpufreq_target(struct cpufreq_policy *policy, + unsigned int target_freq, + unsigned int relation) +{ + struct cpufreq_freqs freqs; + int ret = 0; + + freqs.old = policy->cur; + freqs.new = target_freq; + + cpufreq_freq_transition_begin(policy, &freqs); + ret = virt_cpufreq_set_perf(policy, target_freq); + cpufreq_freq_transition_end(policy, &freqs, ret != 0); + + return ret; +} + +static int virt_cpufreq_get_sharing_cpus(struct cpufreq_policy *policy) +{ + u32 cur_perf_domain, perf_domain; + struct device *cpu_dev; + int cpu; + + cur_perf_domain = readl_relaxed(base + policy->cpu * + PER_CPU_OFFSET + REG_PERF_DOMAIN_OFFSET); + + for_each_possible_cpu(cpu) { + cpu_dev = get_cpu_device(cpu); + if (!cpu_dev) + continue; + + perf_domain = readl_relaxed(base + cpu * + PER_CPU_OFFSET + REG_PERF_DOMAIN_OFFSET); + + if (perf_domain == cur_perf_domain) + cpumask_set_cpu(cpu, policy->cpus); + } + + return 0; +} + +static int virt_cpufreq_get_freq_info(struct cpufreq_policy *policy) +{ + struct cpufreq_frequency_table *table; + u32 num_perftbl_entries, idx; + + num_perftbl_entries = per_cpu(perftbl_num_entries, policy->cpu); + + if (num_perftbl_entries == 1) { + policy->cpuinfo.min_freq = 1; + policy->cpuinfo.max_freq = virt_cpufreq_get_perftbl_entry(policy->cpu, 0); + + policy->min = policy->cpuinfo.min_freq; + policy->max = policy->cpuinfo.max_freq; + + policy->cur = policy->max; + return 0; + } + + table = kcalloc(num_perftbl_entries + 1, sizeof(*table), GFP_KERNEL); + if (!table) + return -ENOMEM; + + for (idx = 0; idx < num_perftbl_entries; idx++) + table[idx].frequency = virt_cpufreq_get_perftbl_entry(policy->cpu, idx); + + table[idx].frequency = CPUFREQ_TABLE_END; + policy->freq_table = table; + + return 0; +} + +static int virt_cpufreq_cpu_init(struct cpufreq_policy *policy) +{ + struct device *cpu_dev; + int ret; + + cpu_dev = get_cpu_device(policy->cpu); + if (!cpu_dev) + return -ENODEV; + + ret = virt_cpufreq_get_freq_info(policy); + if (ret) { + dev_warn(cpu_dev, "failed to get cpufreq info\n"); + return ret; + } + + ret = virt_cpufreq_get_sharing_cpus(policy); + if (ret) { + dev_warn(cpu_dev, "failed to get sharing cpumask\n"); + return ret; + } + + /* + * To simplify and improve latency of handling frequency requests on + * the host side, this ensures that the vCPU thread triggering the MMIO + * abort is the same thread whose performance constraints (Ex. uclamp + * settings) need to be updated. This simplifies the VMM (Virtual + * Machine Manager) having to find the correct vCPU thread and/or + * facing permission issues when configuring other threads. + */ + policy->dvfs_possible_from_any_cpu = false; + policy->fast_switch_possible = true; + + /* + * Using the default SCALE_FREQ_SOURCE_CPUFREQ is insufficient since + * the actual physical CPU frequency may not match requested frequency + * from the vCPU thread due to frequency update latencies or other + * inputs to the physical CPU frequency selection. This additional FIE + * source allows for more accurate freq_scale updates and only takes + * effect if another FIE source such as AMUs have not been registered. + */ + topology_set_scale_freq_source(&virt_sfd, policy->cpus); + + return 0; +} + +static int virt_cpufreq_cpu_exit(struct cpufreq_policy *policy) +{ + topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_VIRT, policy->related_cpus); + kfree(policy->freq_table); + return 0; +} + +static int virt_cpufreq_online(struct cpufreq_policy *policy) +{ + /* Nothing to restore. */ + return 0; +} + +static int virt_cpufreq_offline(struct cpufreq_policy *policy) +{ + /* Dummy offline() to avoid exit() being called and freeing resources. */ + return 0; +} + +static int virt_cpufreq_verify_policy(struct cpufreq_policy_data *policy) +{ + if (policy->freq_table) + return cpufreq_frequency_table_verify(policy, policy->freq_table); + + cpufreq_verify_within_cpu_limits(policy); + return 0; +} + +static struct cpufreq_driver cpufreq_virt_driver = { + .name = "virt-cpufreq", + .init = virt_cpufreq_cpu_init, + .exit = virt_cpufreq_cpu_exit, + .online = virt_cpufreq_online, + .offline = virt_cpufreq_offline, + .verify = virt_cpufreq_verify_policy, + .target = virt_cpufreq_target, + .fast_switch = virt_cpufreq_fast_switch, + .attr = cpufreq_generic_attr, +}; + +static int virt_cpufreq_driver_probe(struct platform_device *pdev) +{ + u32 num_perftbl_entries; + int ret, cpu; + + base = devm_platform_ioremap_resource(pdev, 0); + if (IS_ERR(base)) + return PTR_ERR(base); + + for_each_possible_cpu(cpu) { + num_perftbl_entries = readl_relaxed(base + cpu * PER_CPU_OFFSET + + REG_PERFTBL_LEN_OFFSET); + + if (!num_perftbl_entries || num_perftbl_entries > PERFTBL_MAX_ENTRIES) + return -ENODEV; + + per_cpu(perftbl_num_entries, cpu) = num_perftbl_entries; + } + + ret = cpufreq_register_driver(&cpufreq_virt_driver); + if (ret) { + dev_err(&pdev->dev, "Virtual CPUFreq driver failed to register: %d\n", ret); + return ret; + } + + dev_dbg(&pdev->dev, "Virtual CPUFreq driver initialized\n"); + return 0; +} + +static int virt_cpufreq_driver_remove(struct platform_device *pdev) +{ + cpufreq_unregister_driver(&cpufreq_virt_driver); + return 0; +} + +static const struct of_device_id virt_cpufreq_match[] = { + { .compatible = "qemu,virtual-cpufreq", .data = NULL}, + {} +}; +MODULE_DEVICE_TABLE(of, virt_cpufreq_match); + +static struct platform_driver virt_cpufreq_driver = { + .probe = virt_cpufreq_driver_probe, + .remove = virt_cpufreq_driver_remove, + .driver = { + .name = "virt-cpufreq", + .of_match_table = virt_cpufreq_match, + }, +}; + +static int __init virt_cpufreq_init(void) +{ + return platform_driver_register(&virt_cpufreq_driver); +} +postcore_initcall(virt_cpufreq_init); + +static void __exit virt_cpufreq_exit(void) +{ + platform_driver_unregister(&virt_cpufreq_driver); +} +module_exit(virt_cpufreq_exit); + +MODULE_DESCRIPTION("Virtual cpufreq driver"); +MODULE_LICENSE("GPL"); diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h index b721f360d759..d5d848849408 100644 --- a/include/linux/arch_topology.h +++ b/include/linux/arch_topology.h @@ -49,6 +49,7 @@ enum scale_freq_source { SCALE_FREQ_SOURCE_CPUFREQ = 0, SCALE_FREQ_SOURCE_ARCH, SCALE_FREQ_SOURCE_CPPC, + SCALE_FREQ_SOURCE_VIRT, }; struct scale_freq_data { -- 2.45.0.215.g3402c0e53f-goog ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v6 2/2] cpufreq: add virtual-cpufreq driver 2024-05-21 4:30 ` [PATCH v6 2/2] cpufreq: add virtual-cpufreq driver David Dai @ 2024-06-27 21:22 ` David Dai 2024-06-28 12:01 ` Rafael J. Wysocki 2024-06-28 12:51 ` Sudeep Holla 1 sibling, 1 reply; 13+ messages in thread From: David Dai @ 2024-06-27 21:22 UTC (permalink / raw) To: Rafael J. Wysocki, Viresh Kumar, Rob Herring, Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, David Dai, Saravana Kannan Cc: Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel Hi folks, Gentle nudge on this patch to see if there's any remaining concerns? Thanks, David On Mon, May 20, 2024 at 10:32 PM David Dai <davidai@google.com> wrote: > > Introduce a virtualized cpufreq driver for guest kernels to improve > performance and power of workloads within VMs. > > This driver does two main things: > > 1. Sends the frequency of vCPUs as a hint to the host. The host uses the > hint to schedule the vCPU threads and decide physical CPU frequency. > > 2. If a VM does not support a virtualized FIE(like AMUs), it queries the > host CPU frequency by reading a MMIO region of a virtual cpufreq device > to update the guest's frequency scaling factor periodically. This enables > accurate Per-Entity Load Tracking for tasks running in the guest. > > Co-developed-by: Saravana Kannan <saravanak@google.com> > Signed-off-by: Saravana Kannan <saravanak@google.com> > Signed-off-by: David Dai <davidai@google.com> > --- > drivers/cpufreq/Kconfig | 14 ++ > drivers/cpufreq/Makefile | 1 + > drivers/cpufreq/virtual-cpufreq.c | 335 ++++++++++++++++++++++++++++++ > include/linux/arch_topology.h | 1 + > 4 files changed, 351 insertions(+) > create mode 100644 drivers/cpufreq/virtual-cpufreq.c > > diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig > index 94e55c40970a..9aa86bec5793 100644 > --- a/drivers/cpufreq/Kconfig > +++ b/drivers/cpufreq/Kconfig > @@ -217,6 +217,20 @@ config CPUFREQ_DT > > If in doubt, say N. > > +config CPUFREQ_VIRT > + tristate "Virtual cpufreq driver" > + depends on OF && GENERIC_ARCH_TOPOLOGY > + help > + This adds a virtualized cpufreq driver for guest kernels that > + read/writes to a MMIO region for a virtualized cpufreq device to > + communicate with the host. It sends performance requests to the host > + which gets used as a hint to schedule vCPU threads and select CPU > + frequency. If a VM does not support a virtualized FIE such as AMUs, > + it updates the frequency scaling factor by polling host CPU frequency > + to enable accurate Per-Entity Load Tracking for tasks running in the guest. > + > + If in doubt, say N. > + > config CPUFREQ_DT_PLATDEV > tristate "Generic DT based cpufreq platdev driver" > depends on OF > diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile > index 8d141c71b016..eb72ecdc24db 100644 > --- a/drivers/cpufreq/Makefile > +++ b/drivers/cpufreq/Makefile > @@ -16,6 +16,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_ATTR_SET) += cpufreq_governor_attr_set.o > > obj-$(CONFIG_CPUFREQ_DT) += cpufreq-dt.o > obj-$(CONFIG_CPUFREQ_DT_PLATDEV) += cpufreq-dt-platdev.o > +obj-$(CONFIG_CPUFREQ_VIRT) += virtual-cpufreq.o > > # Traces > CFLAGS_amd-pstate-trace.o := -I$(src) > diff --git a/drivers/cpufreq/virtual-cpufreq.c b/drivers/cpufreq/virtual-cpufreq.c > new file mode 100644 > index 000000000000..59ce2bda3913 > --- /dev/null > +++ b/drivers/cpufreq/virtual-cpufreq.c > @@ -0,0 +1,335 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * Copyright (C) 2024 Google LLC > + */ > + > +#include <linux/arch_topology.h> > +#include <linux/cpufreq.h> > +#include <linux/init.h> > +#include <linux/sched.h> > +#include <linux/kernel.h> > +#include <linux/module.h> > +#include <linux/of_address.h> > +#include <linux/of_platform.h> > +#include <linux/platform_device.h> > +#include <linux/slab.h> > + > +/* > + * CPU0..CPUn > + * +-------------+-------------------------------+--------+-------+ > + * | Register | Description | Offset | Len | > + * +-------------+-------------------------------+--------+-------+ > + * | cur_perf | read this register to get | 0x0 | 0x4 | > + * | | the current perf (integer val | | | > + * | | representing perf relative to | | | > + * | | max performance) | | | > + * | | that vCPU is running at | | | > + * +-------------+-------------------------------+--------+-------+ > + * | set_perf | write to this register to set | 0x4 | 0x4 | > + * | | perf value of the vCPU | | | > + * +-------------+-------------------------------+--------+-------+ > + * | perftbl_len | number of entries in perf | 0x8 | 0x4 | > + * | | table. A single entry in the | | | > + * | | perf table denotes no table | | | > + * | | and the entry contains | | | > + * | | the maximum perf value | | | > + * | | that this vCPU supports. | | | > + * | | The guest can request any | | | > + * | | value between 1 and max perf | | | > + * | | when perftbls are not used. | | | > + * +---------------------------------------------+--------+-------+ > + * | perftbl_sel | write to this register to | 0xc | 0x4 | > + * | | select perf table entry to | | | > + * | | read from | | | > + * +---------------------------------------------+--------+-------+ > + * | perftbl_rd | read this register to get | 0x10 | 0x4 | > + * | | perf value of the selected | | | > + * | | entry based on perftbl_sel | | | > + * +---------------------------------------------+--------+-------+ > + * | perf_domain | performance domain number | 0x14 | 0x4 | > + * | | that this vCPU belongs to. | | | > + * | | vCPUs sharing the same perf | | | > + * | | domain number are part of the | | | > + * | | same performance domain. | | | > + * +-------------+-------------------------------+--------+-------+ > + */ > + > +#define REG_CUR_PERF_STATE_OFFSET 0x0 > +#define REG_SET_PERF_STATE_OFFSET 0x4 > +#define REG_PERFTBL_LEN_OFFSET 0x8 > +#define REG_PERFTBL_SEL_OFFSET 0xc > +#define REG_PERFTBL_RD_OFFSET 0x10 > +#define REG_PERF_DOMAIN_OFFSET 0x14 > +#define PER_CPU_OFFSET 0x1000 > + > +#define PERFTBL_MAX_ENTRIES 64U > + > +static void __iomem *base; > +static DEFINE_PER_CPU(u32, perftbl_num_entries); > + > +static void virt_scale_freq_tick(void) > +{ > + int cpu = smp_processor_id(); > + u32 max_freq = (u32)cpufreq_get_hw_max_freq(cpu); > + u64 cur_freq; > + unsigned long scale; > + > + cur_freq = (u64)readl_relaxed(base + cpu * PER_CPU_OFFSET > + + REG_CUR_PERF_STATE_OFFSET); > + > + cur_freq <<= SCHED_CAPACITY_SHIFT; > + scale = (unsigned long)div_u64(cur_freq, max_freq); > + scale = min(scale, SCHED_CAPACITY_SCALE); > + > + this_cpu_write(arch_freq_scale, scale); > +} > + > +static struct scale_freq_data virt_sfd = { > + .source = SCALE_FREQ_SOURCE_VIRT, > + .set_freq_scale = virt_scale_freq_tick, > +}; > + > +static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy, > + unsigned int target_freq) > +{ > + writel_relaxed(target_freq, > + base + policy->cpu * PER_CPU_OFFSET + REG_SET_PERF_STATE_OFFSET); > + return 0; > +} > + > +static unsigned int virt_cpufreq_fast_switch(struct cpufreq_policy *policy, > + unsigned int target_freq) > +{ > + virt_cpufreq_set_perf(policy, target_freq); > + return target_freq; > +} > + > +static u32 virt_cpufreq_get_perftbl_entry(int cpu, u32 idx) > +{ > + writel_relaxed(idx, base + cpu * PER_CPU_OFFSET + > + REG_PERFTBL_SEL_OFFSET); > + return readl_relaxed(base + cpu * PER_CPU_OFFSET + > + REG_PERFTBL_RD_OFFSET); > +} > + > +static int virt_cpufreq_target(struct cpufreq_policy *policy, > + unsigned int target_freq, > + unsigned int relation) > +{ > + struct cpufreq_freqs freqs; > + int ret = 0; > + > + freqs.old = policy->cur; > + freqs.new = target_freq; > + > + cpufreq_freq_transition_begin(policy, &freqs); > + ret = virt_cpufreq_set_perf(policy, target_freq); > + cpufreq_freq_transition_end(policy, &freqs, ret != 0); > + > + return ret; > +} > + > +static int virt_cpufreq_get_sharing_cpus(struct cpufreq_policy *policy) > +{ > + u32 cur_perf_domain, perf_domain; > + struct device *cpu_dev; > + int cpu; > + > + cur_perf_domain = readl_relaxed(base + policy->cpu * > + PER_CPU_OFFSET + REG_PERF_DOMAIN_OFFSET); > + > + for_each_possible_cpu(cpu) { > + cpu_dev = get_cpu_device(cpu); > + if (!cpu_dev) > + continue; > + > + perf_domain = readl_relaxed(base + cpu * > + PER_CPU_OFFSET + REG_PERF_DOMAIN_OFFSET); > + > + if (perf_domain == cur_perf_domain) > + cpumask_set_cpu(cpu, policy->cpus); > + } > + > + return 0; > +} > + > +static int virt_cpufreq_get_freq_info(struct cpufreq_policy *policy) > +{ > + struct cpufreq_frequency_table *table; > + u32 num_perftbl_entries, idx; > + > + num_perftbl_entries = per_cpu(perftbl_num_entries, policy->cpu); > + > + if (num_perftbl_entries == 1) { > + policy->cpuinfo.min_freq = 1; > + policy->cpuinfo.max_freq = virt_cpufreq_get_perftbl_entry(policy->cpu, 0); > + > + policy->min = policy->cpuinfo.min_freq; > + policy->max = policy->cpuinfo.max_freq; > + > + policy->cur = policy->max; > + return 0; > + } > + > + table = kcalloc(num_perftbl_entries + 1, sizeof(*table), GFP_KERNEL); > + if (!table) > + return -ENOMEM; > + > + for (idx = 0; idx < num_perftbl_entries; idx++) > + table[idx].frequency = virt_cpufreq_get_perftbl_entry(policy->cpu, idx); > + > + table[idx].frequency = CPUFREQ_TABLE_END; > + policy->freq_table = table; > + > + return 0; > +} > + > +static int virt_cpufreq_cpu_init(struct cpufreq_policy *policy) > +{ > + struct device *cpu_dev; > + int ret; > + > + cpu_dev = get_cpu_device(policy->cpu); > + if (!cpu_dev) > + return -ENODEV; > + > + ret = virt_cpufreq_get_freq_info(policy); > + if (ret) { > + dev_warn(cpu_dev, "failed to get cpufreq info\n"); > + return ret; > + } > + > + ret = virt_cpufreq_get_sharing_cpus(policy); > + if (ret) { > + dev_warn(cpu_dev, "failed to get sharing cpumask\n"); > + return ret; > + } > + > + /* > + * To simplify and improve latency of handling frequency requests on > + * the host side, this ensures that the vCPU thread triggering the MMIO > + * abort is the same thread whose performance constraints (Ex. uclamp > + * settings) need to be updated. This simplifies the VMM (Virtual > + * Machine Manager) having to find the correct vCPU thread and/or > + * facing permission issues when configuring other threads. > + */ > + policy->dvfs_possible_from_any_cpu = false; > + policy->fast_switch_possible = true; > + > + /* > + * Using the default SCALE_FREQ_SOURCE_CPUFREQ is insufficient since > + * the actual physical CPU frequency may not match requested frequency > + * from the vCPU thread due to frequency update latencies or other > + * inputs to the physical CPU frequency selection. This additional FIE > + * source allows for more accurate freq_scale updates and only takes > + * effect if another FIE source such as AMUs have not been registered. > + */ > + topology_set_scale_freq_source(&virt_sfd, policy->cpus); > + > + return 0; > +} > + > +static int virt_cpufreq_cpu_exit(struct cpufreq_policy *policy) > +{ > + topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_VIRT, policy->related_cpus); > + kfree(policy->freq_table); > + return 0; > +} > + > +static int virt_cpufreq_online(struct cpufreq_policy *policy) > +{ > + /* Nothing to restore. */ > + return 0; > +} > + > +static int virt_cpufreq_offline(struct cpufreq_policy *policy) > +{ > + /* Dummy offline() to avoid exit() being called and freeing resources. */ > + return 0; > +} > + > +static int virt_cpufreq_verify_policy(struct cpufreq_policy_data *policy) > +{ > + if (policy->freq_table) > + return cpufreq_frequency_table_verify(policy, policy->freq_table); > + > + cpufreq_verify_within_cpu_limits(policy); > + return 0; > +} > + > +static struct cpufreq_driver cpufreq_virt_driver = { > + .name = "virt-cpufreq", > + .init = virt_cpufreq_cpu_init, > + .exit = virt_cpufreq_cpu_exit, > + .online = virt_cpufreq_online, > + .offline = virt_cpufreq_offline, > + .verify = virt_cpufreq_verify_policy, > + .target = virt_cpufreq_target, > + .fast_switch = virt_cpufreq_fast_switch, > + .attr = cpufreq_generic_attr, > +}; > + > +static int virt_cpufreq_driver_probe(struct platform_device *pdev) > +{ > + u32 num_perftbl_entries; > + int ret, cpu; > + > + base = devm_platform_ioremap_resource(pdev, 0); > + if (IS_ERR(base)) > + return PTR_ERR(base); > + > + for_each_possible_cpu(cpu) { > + num_perftbl_entries = readl_relaxed(base + cpu * PER_CPU_OFFSET + > + REG_PERFTBL_LEN_OFFSET); > + > + if (!num_perftbl_entries || num_perftbl_entries > PERFTBL_MAX_ENTRIES) > + return -ENODEV; > + > + per_cpu(perftbl_num_entries, cpu) = num_perftbl_entries; > + } > + > + ret = cpufreq_register_driver(&cpufreq_virt_driver); > + if (ret) { > + dev_err(&pdev->dev, "Virtual CPUFreq driver failed to register: %d\n", ret); > + return ret; > + } > + > + dev_dbg(&pdev->dev, "Virtual CPUFreq driver initialized\n"); > + return 0; > +} > + > +static int virt_cpufreq_driver_remove(struct platform_device *pdev) > +{ > + cpufreq_unregister_driver(&cpufreq_virt_driver); > + return 0; > +} > + > +static const struct of_device_id virt_cpufreq_match[] = { > + { .compatible = "qemu,virtual-cpufreq", .data = NULL}, > + {} > +}; > +MODULE_DEVICE_TABLE(of, virt_cpufreq_match); > + > +static struct platform_driver virt_cpufreq_driver = { > + .probe = virt_cpufreq_driver_probe, > + .remove = virt_cpufreq_driver_remove, > + .driver = { > + .name = "virt-cpufreq", > + .of_match_table = virt_cpufreq_match, > + }, > +}; > + > +static int __init virt_cpufreq_init(void) > +{ > + return platform_driver_register(&virt_cpufreq_driver); > +} > +postcore_initcall(virt_cpufreq_init); > + > +static void __exit virt_cpufreq_exit(void) > +{ > + platform_driver_unregister(&virt_cpufreq_driver); > +} > +module_exit(virt_cpufreq_exit); > + > +MODULE_DESCRIPTION("Virtual cpufreq driver"); > +MODULE_LICENSE("GPL"); > diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h > index b721f360d759..d5d848849408 100644 > --- a/include/linux/arch_topology.h > +++ b/include/linux/arch_topology.h > @@ -49,6 +49,7 @@ enum scale_freq_source { > SCALE_FREQ_SOURCE_CPUFREQ = 0, > SCALE_FREQ_SOURCE_ARCH, > SCALE_FREQ_SOURCE_CPPC, > + SCALE_FREQ_SOURCE_VIRT, > }; > > struct scale_freq_data { > -- > 2.45.0.215.g3402c0e53f-goog > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 2/2] cpufreq: add virtual-cpufreq driver 2024-06-27 21:22 ` David Dai @ 2024-06-28 12:01 ` Rafael J. Wysocki 2024-06-28 12:42 ` Sudeep Holla 2024-07-10 0:07 ` Saravana Kannan 0 siblings, 2 replies; 13+ messages in thread From: Rafael J. Wysocki @ 2024-06-28 12:01 UTC (permalink / raw) To: David Dai Cc: Rafael J. Wysocki, Viresh Kumar, Rob Herring, Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Saravana Kannan, Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel Hi, On Thu, Jun 27, 2024 at 11:22 PM David Dai <davidai@google.com> wrote: > > Hi folks, > > Gentle nudge on this patch to see if there's any remaining concerns? Yes, there are. The dependency of OF is pretty much a no-go from my perspective. Thanks! > On Mon, May 20, 2024 at 10:32 PM David Dai <davidai@google.com> wrote: > > > > Introduce a virtualized cpufreq driver for guest kernels to improve > > performance and power of workloads within VMs. > > > > This driver does two main things: > > > > 1. Sends the frequency of vCPUs as a hint to the host. The host uses the > > hint to schedule the vCPU threads and decide physical CPU frequency. > > > > 2. If a VM does not support a virtualized FIE(like AMUs), it queries the > > host CPU frequency by reading a MMIO region of a virtual cpufreq device > > to update the guest's frequency scaling factor periodically. This enables > > accurate Per-Entity Load Tracking for tasks running in the guest. > > > > Co-developed-by: Saravana Kannan <saravanak@google.com> > > Signed-off-by: Saravana Kannan <saravanak@google.com> > > Signed-off-by: David Dai <davidai@google.com> > > --- > > drivers/cpufreq/Kconfig | 14 ++ > > drivers/cpufreq/Makefile | 1 + > > drivers/cpufreq/virtual-cpufreq.c | 335 ++++++++++++++++++++++++++++++ > > include/linux/arch_topology.h | 1 + > > 4 files changed, 351 insertions(+) > > create mode 100644 drivers/cpufreq/virtual-cpufreq.c > > > > diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig > > index 94e55c40970a..9aa86bec5793 100644 > > --- a/drivers/cpufreq/Kconfig > > +++ b/drivers/cpufreq/Kconfig > > @@ -217,6 +217,20 @@ config CPUFREQ_DT > > > > If in doubt, say N. > > > > +config CPUFREQ_VIRT > > + tristate "Virtual cpufreq driver" > > + depends on OF && GENERIC_ARCH_TOPOLOGY > > + help > > + This adds a virtualized cpufreq driver for guest kernels that > > + read/writes to a MMIO region for a virtualized cpufreq device to > > + communicate with the host. It sends performance requests to the host > > + which gets used as a hint to schedule vCPU threads and select CPU > > + frequency. If a VM does not support a virtualized FIE such as AMUs, > > + it updates the frequency scaling factor by polling host CPU frequency > > + to enable accurate Per-Entity Load Tracking for tasks running in the guest. > > + > > + If in doubt, say N. > > + > > config CPUFREQ_DT_PLATDEV > > tristate "Generic DT based cpufreq platdev driver" > > depends on OF > > diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile > > index 8d141c71b016..eb72ecdc24db 100644 > > --- a/drivers/cpufreq/Makefile > > +++ b/drivers/cpufreq/Makefile > > @@ -16,6 +16,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_ATTR_SET) += cpufreq_governor_attr_set.o > > > > obj-$(CONFIG_CPUFREQ_DT) += cpufreq-dt.o > > obj-$(CONFIG_CPUFREQ_DT_PLATDEV) += cpufreq-dt-platdev.o > > +obj-$(CONFIG_CPUFREQ_VIRT) += virtual-cpufreq.o > > > > # Traces > > CFLAGS_amd-pstate-trace.o := -I$(src) > > diff --git a/drivers/cpufreq/virtual-cpufreq.c b/drivers/cpufreq/virtual-cpufreq.c > > new file mode 100644 > > index 000000000000..59ce2bda3913 > > --- /dev/null > > +++ b/drivers/cpufreq/virtual-cpufreq.c > > @@ -0,0 +1,335 @@ > > +// SPDX-License-Identifier: GPL-2.0-only > > +/* > > + * Copyright (C) 2024 Google LLC > > + */ > > + > > +#include <linux/arch_topology.h> > > +#include <linux/cpufreq.h> > > +#include <linux/init.h> > > +#include <linux/sched.h> > > +#include <linux/kernel.h> > > +#include <linux/module.h> > > +#include <linux/of_address.h> > > +#include <linux/of_platform.h> > > +#include <linux/platform_device.h> > > +#include <linux/slab.h> > > + > > +/* > > + * CPU0..CPUn > > + * +-------------+-------------------------------+--------+-------+ > > + * | Register | Description | Offset | Len | > > + * +-------------+-------------------------------+--------+-------+ > > + * | cur_perf | read this register to get | 0x0 | 0x4 | > > + * | | the current perf (integer val | | | > > + * | | representing perf relative to | | | > > + * | | max performance) | | | > > + * | | that vCPU is running at | | | > > + * +-------------+-------------------------------+--------+-------+ > > + * | set_perf | write to this register to set | 0x4 | 0x4 | > > + * | | perf value of the vCPU | | | > > + * +-------------+-------------------------------+--------+-------+ > > + * | perftbl_len | number of entries in perf | 0x8 | 0x4 | > > + * | | table. A single entry in the | | | > > + * | | perf table denotes no table | | | > > + * | | and the entry contains | | | > > + * | | the maximum perf value | | | > > + * | | that this vCPU supports. | | | > > + * | | The guest can request any | | | > > + * | | value between 1 and max perf | | | > > + * | | when perftbls are not used. | | | > > + * +---------------------------------------------+--------+-------+ > > + * | perftbl_sel | write to this register to | 0xc | 0x4 | > > + * | | select perf table entry to | | | > > + * | | read from | | | > > + * +---------------------------------------------+--------+-------+ > > + * | perftbl_rd | read this register to get | 0x10 | 0x4 | > > + * | | perf value of the selected | | | > > + * | | entry based on perftbl_sel | | | > > + * +---------------------------------------------+--------+-------+ > > + * | perf_domain | performance domain number | 0x14 | 0x4 | > > + * | | that this vCPU belongs to. | | | > > + * | | vCPUs sharing the same perf | | | > > + * | | domain number are part of the | | | > > + * | | same performance domain. | | | > > + * +-------------+-------------------------------+--------+-------+ > > + */ > > + > > +#define REG_CUR_PERF_STATE_OFFSET 0x0 > > +#define REG_SET_PERF_STATE_OFFSET 0x4 > > +#define REG_PERFTBL_LEN_OFFSET 0x8 > > +#define REG_PERFTBL_SEL_OFFSET 0xc > > +#define REG_PERFTBL_RD_OFFSET 0x10 > > +#define REG_PERF_DOMAIN_OFFSET 0x14 > > +#define PER_CPU_OFFSET 0x1000 > > + > > +#define PERFTBL_MAX_ENTRIES 64U > > + > > +static void __iomem *base; > > +static DEFINE_PER_CPU(u32, perftbl_num_entries); > > + > > +static void virt_scale_freq_tick(void) > > +{ > > + int cpu = smp_processor_id(); > > + u32 max_freq = (u32)cpufreq_get_hw_max_freq(cpu); > > + u64 cur_freq; > > + unsigned long scale; > > + > > + cur_freq = (u64)readl_relaxed(base + cpu * PER_CPU_OFFSET > > + + REG_CUR_PERF_STATE_OFFSET); > > + > > + cur_freq <<= SCHED_CAPACITY_SHIFT; > > + scale = (unsigned long)div_u64(cur_freq, max_freq); > > + scale = min(scale, SCHED_CAPACITY_SCALE); > > + > > + this_cpu_write(arch_freq_scale, scale); > > +} > > + > > +static struct scale_freq_data virt_sfd = { > > + .source = SCALE_FREQ_SOURCE_VIRT, > > + .set_freq_scale = virt_scale_freq_tick, > > +}; > > + > > +static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy, > > + unsigned int target_freq) > > +{ > > + writel_relaxed(target_freq, > > + base + policy->cpu * PER_CPU_OFFSET + REG_SET_PERF_STATE_OFFSET); > > + return 0; > > +} > > + > > +static unsigned int virt_cpufreq_fast_switch(struct cpufreq_policy *policy, > > + unsigned int target_freq) > > +{ > > + virt_cpufreq_set_perf(policy, target_freq); > > + return target_freq; > > +} > > + > > +static u32 virt_cpufreq_get_perftbl_entry(int cpu, u32 idx) > > +{ > > + writel_relaxed(idx, base + cpu * PER_CPU_OFFSET + > > + REG_PERFTBL_SEL_OFFSET); > > + return readl_relaxed(base + cpu * PER_CPU_OFFSET + > > + REG_PERFTBL_RD_OFFSET); > > +} > > + > > +static int virt_cpufreq_target(struct cpufreq_policy *policy, > > + unsigned int target_freq, > > + unsigned int relation) > > +{ > > + struct cpufreq_freqs freqs; > > + int ret = 0; > > + > > + freqs.old = policy->cur; > > + freqs.new = target_freq; > > + > > + cpufreq_freq_transition_begin(policy, &freqs); > > + ret = virt_cpufreq_set_perf(policy, target_freq); > > + cpufreq_freq_transition_end(policy, &freqs, ret != 0); > > + > > + return ret; > > +} > > + > > +static int virt_cpufreq_get_sharing_cpus(struct cpufreq_policy *policy) > > +{ > > + u32 cur_perf_domain, perf_domain; > > + struct device *cpu_dev; > > + int cpu; > > + > > + cur_perf_domain = readl_relaxed(base + policy->cpu * > > + PER_CPU_OFFSET + REG_PERF_DOMAIN_OFFSET); > > + > > + for_each_possible_cpu(cpu) { > > + cpu_dev = get_cpu_device(cpu); > > + if (!cpu_dev) > > + continue; > > + > > + perf_domain = readl_relaxed(base + cpu * > > + PER_CPU_OFFSET + REG_PERF_DOMAIN_OFFSET); > > + > > + if (perf_domain == cur_perf_domain) > > + cpumask_set_cpu(cpu, policy->cpus); > > + } > > + > > + return 0; > > +} > > + > > +static int virt_cpufreq_get_freq_info(struct cpufreq_policy *policy) > > +{ > > + struct cpufreq_frequency_table *table; > > + u32 num_perftbl_entries, idx; > > + > > + num_perftbl_entries = per_cpu(perftbl_num_entries, policy->cpu); > > + > > + if (num_perftbl_entries == 1) { > > + policy->cpuinfo.min_freq = 1; > > + policy->cpuinfo.max_freq = virt_cpufreq_get_perftbl_entry(policy->cpu, 0); > > + > > + policy->min = policy->cpuinfo.min_freq; > > + policy->max = policy->cpuinfo.max_freq; > > + > > + policy->cur = policy->max; > > + return 0; > > + } > > + > > + table = kcalloc(num_perftbl_entries + 1, sizeof(*table), GFP_KERNEL); > > + if (!table) > > + return -ENOMEM; > > + > > + for (idx = 0; idx < num_perftbl_entries; idx++) > > + table[idx].frequency = virt_cpufreq_get_perftbl_entry(policy->cpu, idx); > > + > > + table[idx].frequency = CPUFREQ_TABLE_END; > > + policy->freq_table = table; > > + > > + return 0; > > +} > > + > > +static int virt_cpufreq_cpu_init(struct cpufreq_policy *policy) > > +{ > > + struct device *cpu_dev; > > + int ret; > > + > > + cpu_dev = get_cpu_device(policy->cpu); > > + if (!cpu_dev) > > + return -ENODEV; > > + > > + ret = virt_cpufreq_get_freq_info(policy); > > + if (ret) { > > + dev_warn(cpu_dev, "failed to get cpufreq info\n"); > > + return ret; > > + } > > + > > + ret = virt_cpufreq_get_sharing_cpus(policy); > > + if (ret) { > > + dev_warn(cpu_dev, "failed to get sharing cpumask\n"); > > + return ret; > > + } > > + > > + /* > > + * To simplify and improve latency of handling frequency requests on > > + * the host side, this ensures that the vCPU thread triggering the MMIO > > + * abort is the same thread whose performance constraints (Ex. uclamp > > + * settings) need to be updated. This simplifies the VMM (Virtual > > + * Machine Manager) having to find the correct vCPU thread and/or > > + * facing permission issues when configuring other threads. > > + */ > > + policy->dvfs_possible_from_any_cpu = false; > > + policy->fast_switch_possible = true; > > + > > + /* > > + * Using the default SCALE_FREQ_SOURCE_CPUFREQ is insufficient since > > + * the actual physical CPU frequency may not match requested frequency > > + * from the vCPU thread due to frequency update latencies or other > > + * inputs to the physical CPU frequency selection. This additional FIE > > + * source allows for more accurate freq_scale updates and only takes > > + * effect if another FIE source such as AMUs have not been registered. > > + */ > > + topology_set_scale_freq_source(&virt_sfd, policy->cpus); > > + > > + return 0; > > +} > > + > > +static int virt_cpufreq_cpu_exit(struct cpufreq_policy *policy) > > +{ > > + topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_VIRT, policy->related_cpus); > > + kfree(policy->freq_table); > > + return 0; > > +} > > + > > +static int virt_cpufreq_online(struct cpufreq_policy *policy) > > +{ > > + /* Nothing to restore. */ > > + return 0; > > +} > > + > > +static int virt_cpufreq_offline(struct cpufreq_policy *policy) > > +{ > > + /* Dummy offline() to avoid exit() being called and freeing resources. */ > > + return 0; > > +} > > + > > +static int virt_cpufreq_verify_policy(struct cpufreq_policy_data *policy) > > +{ > > + if (policy->freq_table) > > + return cpufreq_frequency_table_verify(policy, policy->freq_table); > > + > > + cpufreq_verify_within_cpu_limits(policy); > > + return 0; > > +} > > + > > +static struct cpufreq_driver cpufreq_virt_driver = { > > + .name = "virt-cpufreq", > > + .init = virt_cpufreq_cpu_init, > > + .exit = virt_cpufreq_cpu_exit, > > + .online = virt_cpufreq_online, > > + .offline = virt_cpufreq_offline, > > + .verify = virt_cpufreq_verify_policy, > > + .target = virt_cpufreq_target, > > + .fast_switch = virt_cpufreq_fast_switch, > > + .attr = cpufreq_generic_attr, > > +}; > > + > > +static int virt_cpufreq_driver_probe(struct platform_device *pdev) > > +{ > > + u32 num_perftbl_entries; > > + int ret, cpu; > > + > > + base = devm_platform_ioremap_resource(pdev, 0); > > + if (IS_ERR(base)) > > + return PTR_ERR(base); > > + > > + for_each_possible_cpu(cpu) { > > + num_perftbl_entries = readl_relaxed(base + cpu * PER_CPU_OFFSET + > > + REG_PERFTBL_LEN_OFFSET); > > + > > + if (!num_perftbl_entries || num_perftbl_entries > PERFTBL_MAX_ENTRIES) > > + return -ENODEV; > > + > > + per_cpu(perftbl_num_entries, cpu) = num_perftbl_entries; > > + } > > + > > + ret = cpufreq_register_driver(&cpufreq_virt_driver); > > + if (ret) { > > + dev_err(&pdev->dev, "Virtual CPUFreq driver failed to register: %d\n", ret); > > + return ret; > > + } > > + > > + dev_dbg(&pdev->dev, "Virtual CPUFreq driver initialized\n"); > > + return 0; > > +} > > + > > +static int virt_cpufreq_driver_remove(struct platform_device *pdev) > > +{ > > + cpufreq_unregister_driver(&cpufreq_virt_driver); > > + return 0; > > +} > > + > > +static const struct of_device_id virt_cpufreq_match[] = { > > + { .compatible = "qemu,virtual-cpufreq", .data = NULL}, > > + {} > > +}; > > +MODULE_DEVICE_TABLE(of, virt_cpufreq_match); > > + > > +static struct platform_driver virt_cpufreq_driver = { > > + .probe = virt_cpufreq_driver_probe, > > + .remove = virt_cpufreq_driver_remove, > > + .driver = { > > + .name = "virt-cpufreq", > > + .of_match_table = virt_cpufreq_match, > > + }, > > +}; > > + > > +static int __init virt_cpufreq_init(void) > > +{ > > + return platform_driver_register(&virt_cpufreq_driver); > > +} > > +postcore_initcall(virt_cpufreq_init); > > + > > +static void __exit virt_cpufreq_exit(void) > > +{ > > + platform_driver_unregister(&virt_cpufreq_driver); > > +} > > +module_exit(virt_cpufreq_exit); > > + > > +MODULE_DESCRIPTION("Virtual cpufreq driver"); > > +MODULE_LICENSE("GPL"); > > diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h > > index b721f360d759..d5d848849408 100644 > > --- a/include/linux/arch_topology.h > > +++ b/include/linux/arch_topology.h > > @@ -49,6 +49,7 @@ enum scale_freq_source { > > SCALE_FREQ_SOURCE_CPUFREQ = 0, > > SCALE_FREQ_SOURCE_ARCH, > > SCALE_FREQ_SOURCE_CPPC, > > + SCALE_FREQ_SOURCE_VIRT, > > }; > > > > struct scale_freq_data { > > -- > > 2.45.0.215.g3402c0e53f-goog > > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 2/2] cpufreq: add virtual-cpufreq driver 2024-06-28 12:01 ` Rafael J. Wysocki @ 2024-06-28 12:42 ` Sudeep Holla 2024-07-10 0:07 ` Saravana Kannan 1 sibling, 0 replies; 13+ messages in thread From: Sudeep Holla @ 2024-06-28 12:42 UTC (permalink / raw) To: Rafael J. Wysocki Cc: David Dai, Viresh Kumar, Sudeep Holla, Rob Herring, Krzysztof Kozlowski, Conor Dooley, Saravana Kannan, Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel On Fri, Jun 28, 2024 at 02:01:16PM +0200, Rafael J. Wysocki wrote: > Hi, > > On Thu, Jun 27, 2024 at 11:22 PM David Dai <davidai@google.com> wrote: > > > > Hi folks, > > > > Gentle nudge on this patch to see if there's any remaining concerns? > > Yes, there are. > > The dependency of OF is pretty much a no-go from my perspective. > I agree and I don't think it is needed as well, can be removed easily IMO. -- Regards, Sudeep ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 2/2] cpufreq: add virtual-cpufreq driver 2024-06-28 12:01 ` Rafael J. Wysocki 2024-06-28 12:42 ` Sudeep Holla @ 2024-07-10 0:07 ` Saravana Kannan 2024-08-30 22:51 ` David Dai 1 sibling, 1 reply; 13+ messages in thread From: Saravana Kannan @ 2024-07-10 0:07 UTC (permalink / raw) To: Rafael J. Wysocki Cc: David Dai, Viresh Kumar, Rob Herring, Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel On Fri, Jun 28, 2024 at 5:01 AM Rafael J. Wysocki <rafael@kernel.org> wrote: > > Hi, > > On Thu, Jun 27, 2024 at 11:22 PM David Dai <davidai@google.com> wrote: > > > > Hi folks, > > > > Gentle nudge on this patch to see if there's any remaining concerns? > > Yes, there are. > > The dependency of OF is pretty much a no-go from my perspective. If you are talking about the "depends on OF" I think that can and should be dropped. There's nothing really OF specific about this driver except the probe function being for a platform device/driver. We can easily write an ACPI driver variant of the probe function and have the same driver work for both OF and ACPI versions of the device. We don't have any ACPI experience or a test set up to test the ACPI variant. That's why we haven't added it. We are more than happy to accept tested patches on top of this that'll enable this for ACPI variants too. If we drop the "depends on OF", will you accept this patch? If not, can you give some guidance on what you are looking for? Thanks, Saravana > > Thanks! > > > > On Mon, May 20, 2024 at 10:32 PM David Dai <davidai@google.com> wrote: > > > > > > Introduce a virtualized cpufreq driver for guest kernels to improve > > > performance and power of workloads within VMs. > > > > > > This driver does two main things: > > > > > > 1. Sends the frequency of vCPUs as a hint to the host. The host uses the > > > hint to schedule the vCPU threads and decide physical CPU frequency. > > > > > > 2. If a VM does not support a virtualized FIE(like AMUs), it queries the > > > host CPU frequency by reading a MMIO region of a virtual cpufreq device > > > to update the guest's frequency scaling factor periodically. This enables > > > accurate Per-Entity Load Tracking for tasks running in the guest. > > > > > > Co-developed-by: Saravana Kannan <saravanak@google.com> > > > Signed-off-by: Saravana Kannan <saravanak@google.com> > > > Signed-off-by: David Dai <davidai@google.com> > > > --- > > > drivers/cpufreq/Kconfig | 14 ++ > > > drivers/cpufreq/Makefile | 1 + > > > drivers/cpufreq/virtual-cpufreq.c | 335 ++++++++++++++++++++++++++++++ > > > include/linux/arch_topology.h | 1 + > > > 4 files changed, 351 insertions(+) > > > create mode 100644 drivers/cpufreq/virtual-cpufreq.c > > > > > > diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig > > > index 94e55c40970a..9aa86bec5793 100644 > > > --- a/drivers/cpufreq/Kconfig > > > +++ b/drivers/cpufreq/Kconfig > > > @@ -217,6 +217,20 @@ config CPUFREQ_DT > > > > > > If in doubt, say N. > > > > > > +config CPUFREQ_VIRT > > > + tristate "Virtual cpufreq driver" > > > + depends on OF && GENERIC_ARCH_TOPOLOGY > > > + help > > > + This adds a virtualized cpufreq driver for guest kernels that > > > + read/writes to a MMIO region for a virtualized cpufreq device to > > > + communicate with the host. It sends performance requests to the host > > > + which gets used as a hint to schedule vCPU threads and select CPU > > > + frequency. If a VM does not support a virtualized FIE such as AMUs, > > > + it updates the frequency scaling factor by polling host CPU frequency > > > + to enable accurate Per-Entity Load Tracking for tasks running in the guest. > > > + > > > + If in doubt, say N. > > > + > > > config CPUFREQ_DT_PLATDEV > > > tristate "Generic DT based cpufreq platdev driver" > > > depends on OF > > > diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile > > > index 8d141c71b016..eb72ecdc24db 100644 > > > --- a/drivers/cpufreq/Makefile > > > +++ b/drivers/cpufreq/Makefile > > > @@ -16,6 +16,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_ATTR_SET) += cpufreq_governor_attr_set.o > > > > > > obj-$(CONFIG_CPUFREQ_DT) += cpufreq-dt.o > > > obj-$(CONFIG_CPUFREQ_DT_PLATDEV) += cpufreq-dt-platdev.o > > > +obj-$(CONFIG_CPUFREQ_VIRT) += virtual-cpufreq.o > > > > > > # Traces > > > CFLAGS_amd-pstate-trace.o := -I$(src) > > > diff --git a/drivers/cpufreq/virtual-cpufreq.c b/drivers/cpufreq/virtual-cpufreq.c > > > new file mode 100644 > > > index 000000000000..59ce2bda3913 > > > --- /dev/null > > > +++ b/drivers/cpufreq/virtual-cpufreq.c > > > @@ -0,0 +1,335 @@ > > > +// SPDX-License-Identifier: GPL-2.0-only > > > +/* > > > + * Copyright (C) 2024 Google LLC > > > + */ > > > + > > > +#include <linux/arch_topology.h> > > > +#include <linux/cpufreq.h> > > > +#include <linux/init.h> > > > +#include <linux/sched.h> > > > +#include <linux/kernel.h> > > > +#include <linux/module.h> > > > +#include <linux/of_address.h> > > > +#include <linux/of_platform.h> > > > +#include <linux/platform_device.h> > > > +#include <linux/slab.h> > > > + > > > +/* > > > + * CPU0..CPUn > > > + * +-------------+-------------------------------+--------+-------+ > > > + * | Register | Description | Offset | Len | > > > + * +-------------+-------------------------------+--------+-------+ > > > + * | cur_perf | read this register to get | 0x0 | 0x4 | > > > + * | | the current perf (integer val | | | > > > + * | | representing perf relative to | | | > > > + * | | max performance) | | | > > > + * | | that vCPU is running at | | | > > > + * +-------------+-------------------------------+--------+-------+ > > > + * | set_perf | write to this register to set | 0x4 | 0x4 | > > > + * | | perf value of the vCPU | | | > > > + * +-------------+-------------------------------+--------+-------+ > > > + * | perftbl_len | number of entries in perf | 0x8 | 0x4 | > > > + * | | table. A single entry in the | | | > > > + * | | perf table denotes no table | | | > > > + * | | and the entry contains | | | > > > + * | | the maximum perf value | | | > > > + * | | that this vCPU supports. | | | > > > + * | | The guest can request any | | | > > > + * | | value between 1 and max perf | | | > > > + * | | when perftbls are not used. | | | > > > + * +---------------------------------------------+--------+-------+ > > > + * | perftbl_sel | write to this register to | 0xc | 0x4 | > > > + * | | select perf table entry to | | | > > > + * | | read from | | | > > > + * +---------------------------------------------+--------+-------+ > > > + * | perftbl_rd | read this register to get | 0x10 | 0x4 | > > > + * | | perf value of the selected | | | > > > + * | | entry based on perftbl_sel | | | > > > + * +---------------------------------------------+--------+-------+ > > > + * | perf_domain | performance domain number | 0x14 | 0x4 | > > > + * | | that this vCPU belongs to. | | | > > > + * | | vCPUs sharing the same perf | | | > > > + * | | domain number are part of the | | | > > > + * | | same performance domain. | | | > > > + * +-------------+-------------------------------+--------+-------+ > > > + */ > > > + > > > +#define REG_CUR_PERF_STATE_OFFSET 0x0 > > > +#define REG_SET_PERF_STATE_OFFSET 0x4 > > > +#define REG_PERFTBL_LEN_OFFSET 0x8 > > > +#define REG_PERFTBL_SEL_OFFSET 0xc > > > +#define REG_PERFTBL_RD_OFFSET 0x10 > > > +#define REG_PERF_DOMAIN_OFFSET 0x14 > > > +#define PER_CPU_OFFSET 0x1000 > > > + > > > +#define PERFTBL_MAX_ENTRIES 64U > > > + > > > +static void __iomem *base; > > > +static DEFINE_PER_CPU(u32, perftbl_num_entries); > > > + > > > +static void virt_scale_freq_tick(void) > > > +{ > > > + int cpu = smp_processor_id(); > > > + u32 max_freq = (u32)cpufreq_get_hw_max_freq(cpu); > > > + u64 cur_freq; > > > + unsigned long scale; > > > + > > > + cur_freq = (u64)readl_relaxed(base + cpu * PER_CPU_OFFSET > > > + + REG_CUR_PERF_STATE_OFFSET); > > > + > > > + cur_freq <<= SCHED_CAPACITY_SHIFT; > > > + scale = (unsigned long)div_u64(cur_freq, max_freq); > > > + scale = min(scale, SCHED_CAPACITY_SCALE); > > > + > > > + this_cpu_write(arch_freq_scale, scale); > > > +} > > > + > > > +static struct scale_freq_data virt_sfd = { > > > + .source = SCALE_FREQ_SOURCE_VIRT, > > > + .set_freq_scale = virt_scale_freq_tick, > > > +}; > > > + > > > +static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy, > > > + unsigned int target_freq) > > > +{ > > > + writel_relaxed(target_freq, > > > + base + policy->cpu * PER_CPU_OFFSET + REG_SET_PERF_STATE_OFFSET); > > > + return 0; > > > +} > > > + > > > +static unsigned int virt_cpufreq_fast_switch(struct cpufreq_policy *policy, > > > + unsigned int target_freq) > > > +{ > > > + virt_cpufreq_set_perf(policy, target_freq); > > > + return target_freq; > > > +} > > > + > > > +static u32 virt_cpufreq_get_perftbl_entry(int cpu, u32 idx) > > > +{ > > > + writel_relaxed(idx, base + cpu * PER_CPU_OFFSET + > > > + REG_PERFTBL_SEL_OFFSET); > > > + return readl_relaxed(base + cpu * PER_CPU_OFFSET + > > > + REG_PERFTBL_RD_OFFSET); > > > +} > > > + > > > +static int virt_cpufreq_target(struct cpufreq_policy *policy, > > > + unsigned int target_freq, > > > + unsigned int relation) > > > +{ > > > + struct cpufreq_freqs freqs; > > > + int ret = 0; > > > + > > > + freqs.old = policy->cur; > > > + freqs.new = target_freq; > > > + > > > + cpufreq_freq_transition_begin(policy, &freqs); > > > + ret = virt_cpufreq_set_perf(policy, target_freq); > > > + cpufreq_freq_transition_end(policy, &freqs, ret != 0); > > > + > > > + return ret; > > > +} > > > + > > > +static int virt_cpufreq_get_sharing_cpus(struct cpufreq_policy *policy) > > > +{ > > > + u32 cur_perf_domain, perf_domain; > > > + struct device *cpu_dev; > > > + int cpu; > > > + > > > + cur_perf_domain = readl_relaxed(base + policy->cpu * > > > + PER_CPU_OFFSET + REG_PERF_DOMAIN_OFFSET); > > > + > > > + for_each_possible_cpu(cpu) { > > > + cpu_dev = get_cpu_device(cpu); > > > + if (!cpu_dev) > > > + continue; > > > + > > > + perf_domain = readl_relaxed(base + cpu * > > > + PER_CPU_OFFSET + REG_PERF_DOMAIN_OFFSET); > > > + > > > + if (perf_domain == cur_perf_domain) > > > + cpumask_set_cpu(cpu, policy->cpus); > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +static int virt_cpufreq_get_freq_info(struct cpufreq_policy *policy) > > > +{ > > > + struct cpufreq_frequency_table *table; > > > + u32 num_perftbl_entries, idx; > > > + > > > + num_perftbl_entries = per_cpu(perftbl_num_entries, policy->cpu); > > > + > > > + if (num_perftbl_entries == 1) { > > > + policy->cpuinfo.min_freq = 1; > > > + policy->cpuinfo.max_freq = virt_cpufreq_get_perftbl_entry(policy->cpu, 0); > > > + > > > + policy->min = policy->cpuinfo.min_freq; > > > + policy->max = policy->cpuinfo.max_freq; > > > + > > > + policy->cur = policy->max; > > > + return 0; > > > + } > > > + > > > + table = kcalloc(num_perftbl_entries + 1, sizeof(*table), GFP_KERNEL); > > > + if (!table) > > > + return -ENOMEM; > > > + > > > + for (idx = 0; idx < num_perftbl_entries; idx++) > > > + table[idx].frequency = virt_cpufreq_get_perftbl_entry(policy->cpu, idx); > > > + > > > + table[idx].frequency = CPUFREQ_TABLE_END; > > > + policy->freq_table = table; > > > + > > > + return 0; > > > +} > > > + > > > +static int virt_cpufreq_cpu_init(struct cpufreq_policy *policy) > > > +{ > > > + struct device *cpu_dev; > > > + int ret; > > > + > > > + cpu_dev = get_cpu_device(policy->cpu); > > > + if (!cpu_dev) > > > + return -ENODEV; > > > + > > > + ret = virt_cpufreq_get_freq_info(policy); > > > + if (ret) { > > > + dev_warn(cpu_dev, "failed to get cpufreq info\n"); > > > + return ret; > > > + } > > > + > > > + ret = virt_cpufreq_get_sharing_cpus(policy); > > > + if (ret) { > > > + dev_warn(cpu_dev, "failed to get sharing cpumask\n"); > > > + return ret; > > > + } > > > + > > > + /* > > > + * To simplify and improve latency of handling frequency requests on > > > + * the host side, this ensures that the vCPU thread triggering the MMIO > > > + * abort is the same thread whose performance constraints (Ex. uclamp > > > + * settings) need to be updated. This simplifies the VMM (Virtual > > > + * Machine Manager) having to find the correct vCPU thread and/or > > > + * facing permission issues when configuring other threads. > > > + */ > > > + policy->dvfs_possible_from_any_cpu = false; > > > + policy->fast_switch_possible = true; > > > + > > > + /* > > > + * Using the default SCALE_FREQ_SOURCE_CPUFREQ is insufficient since > > > + * the actual physical CPU frequency may not match requested frequency > > > + * from the vCPU thread due to frequency update latencies or other > > > + * inputs to the physical CPU frequency selection. This additional FIE > > > + * source allows for more accurate freq_scale updates and only takes > > > + * effect if another FIE source such as AMUs have not been registered. > > > + */ > > > + topology_set_scale_freq_source(&virt_sfd, policy->cpus); > > > + > > > + return 0; > > > +} > > > + > > > +static int virt_cpufreq_cpu_exit(struct cpufreq_policy *policy) > > > +{ > > > + topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_VIRT, policy->related_cpus); > > > + kfree(policy->freq_table); > > > + return 0; > > > +} > > > + > > > +static int virt_cpufreq_online(struct cpufreq_policy *policy) > > > +{ > > > + /* Nothing to restore. */ > > > + return 0; > > > +} > > > + > > > +static int virt_cpufreq_offline(struct cpufreq_policy *policy) > > > +{ > > > + /* Dummy offline() to avoid exit() being called and freeing resources. */ > > > + return 0; > > > +} > > > + > > > +static int virt_cpufreq_verify_policy(struct cpufreq_policy_data *policy) > > > +{ > > > + if (policy->freq_table) > > > + return cpufreq_frequency_table_verify(policy, policy->freq_table); > > > + > > > + cpufreq_verify_within_cpu_limits(policy); > > > + return 0; > > > +} > > > + > > > +static struct cpufreq_driver cpufreq_virt_driver = { > > > + .name = "virt-cpufreq", > > > + .init = virt_cpufreq_cpu_init, > > > + .exit = virt_cpufreq_cpu_exit, > > > + .online = virt_cpufreq_online, > > > + .offline = virt_cpufreq_offline, > > > + .verify = virt_cpufreq_verify_policy, > > > + .target = virt_cpufreq_target, > > > + .fast_switch = virt_cpufreq_fast_switch, > > > + .attr = cpufreq_generic_attr, > > > +}; > > > + > > > +static int virt_cpufreq_driver_probe(struct platform_device *pdev) > > > +{ > > > + u32 num_perftbl_entries; > > > + int ret, cpu; > > > + > > > + base = devm_platform_ioremap_resource(pdev, 0); > > > + if (IS_ERR(base)) > > > + return PTR_ERR(base); > > > + > > > + for_each_possible_cpu(cpu) { > > > + num_perftbl_entries = readl_relaxed(base + cpu * PER_CPU_OFFSET + > > > + REG_PERFTBL_LEN_OFFSET); > > > + > > > + if (!num_perftbl_entries || num_perftbl_entries > PERFTBL_MAX_ENTRIES) > > > + return -ENODEV; > > > + > > > + per_cpu(perftbl_num_entries, cpu) = num_perftbl_entries; > > > + } > > > + > > > + ret = cpufreq_register_driver(&cpufreq_virt_driver); > > > + if (ret) { > > > + dev_err(&pdev->dev, "Virtual CPUFreq driver failed to register: %d\n", ret); > > > + return ret; > > > + } > > > + > > > + dev_dbg(&pdev->dev, "Virtual CPUFreq driver initialized\n"); > > > + return 0; > > > +} > > > + > > > +static int virt_cpufreq_driver_remove(struct platform_device *pdev) > > > +{ > > > + cpufreq_unregister_driver(&cpufreq_virt_driver); > > > + return 0; > > > +} > > > + > > > +static const struct of_device_id virt_cpufreq_match[] = { > > > + { .compatible = "qemu,virtual-cpufreq", .data = NULL}, > > > + {} > > > +}; > > > +MODULE_DEVICE_TABLE(of, virt_cpufreq_match); > > > + > > > +static struct platform_driver virt_cpufreq_driver = { > > > + .probe = virt_cpufreq_driver_probe, > > > + .remove = virt_cpufreq_driver_remove, > > > + .driver = { > > > + .name = "virt-cpufreq", > > > + .of_match_table = virt_cpufreq_match, > > > + }, > > > +}; > > > + > > > +static int __init virt_cpufreq_init(void) > > > +{ > > > + return platform_driver_register(&virt_cpufreq_driver); > > > +} > > > +postcore_initcall(virt_cpufreq_init); > > > + > > > +static void __exit virt_cpufreq_exit(void) > > > +{ > > > + platform_driver_unregister(&virt_cpufreq_driver); > > > +} > > > +module_exit(virt_cpufreq_exit); > > > + > > > +MODULE_DESCRIPTION("Virtual cpufreq driver"); > > > +MODULE_LICENSE("GPL"); > > > diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h > > > index b721f360d759..d5d848849408 100644 > > > --- a/include/linux/arch_topology.h > > > +++ b/include/linux/arch_topology.h > > > @@ -49,6 +49,7 @@ enum scale_freq_source { > > > SCALE_FREQ_SOURCE_CPUFREQ = 0, > > > SCALE_FREQ_SOURCE_ARCH, > > > SCALE_FREQ_SOURCE_CPPC, > > > + SCALE_FREQ_SOURCE_VIRT, > > > }; > > > > > > struct scale_freq_data { > > > -- > > > 2.45.0.215.g3402c0e53f-goog > > > > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 2/2] cpufreq: add virtual-cpufreq driver 2024-07-10 0:07 ` Saravana Kannan @ 2024-08-30 22:51 ` David Dai 0 siblings, 0 replies; 13+ messages in thread From: David Dai @ 2024-08-30 22:51 UTC (permalink / raw) To: Saravana Kannan Cc: Rafael J. Wysocki, Viresh Kumar, Rob Herring, Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel On Tue, Jul 9, 2024 at 5:08 PM Saravana Kannan <saravanak@google.com> wrote: > > On Fri, Jun 28, 2024 at 5:01 AM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > Hi, > > > > On Thu, Jun 27, 2024 at 11:22 PM David Dai <davidai@google.com> wrote: > > > > > > Hi folks, > > > > > > Gentle nudge on this patch to see if there's any remaining concerns? > > > > Yes, there are. > > > > The dependency of OF is pretty much a no-go from my perspective. > > If you are talking about the "depends on OF" I think that can and > should be dropped. > > There's nothing really OF specific about this driver except the probe > function being for a platform device/driver. We can easily write an > ACPI driver variant of the probe function and have the same driver > work for both OF and ACPI versions of the device. We don't have any > ACPI experience or a test set up to test the ACPI variant. That's why > we haven't added it. We are more than happy to accept tested patches > on top of this that'll enable this for ACPI variants too. > > If we drop the "depends on OF", will you accept this patch? If not, > can you give some guidance on what you are looking for? Hi Rafael, Just wanted to follow up and see what your thoughts are on this? Thanks, David > > Thanks, > Saravana > > > > > > > > Thanks! > > > > > > > On Mon, May 20, 2024 at 10:32 PM David Dai <davidai@google.com> wrote: > > > > > > > > Introduce a virtualized cpufreq driver for guest kernels to improve > > > > performance and power of workloads within VMs. > > > > > > > > This driver does two main things: > > > > > > > > 1. Sends the frequency of vCPUs as a hint to the host. The host uses the > > > > hint to schedule the vCPU threads and decide physical CPU frequency. > > > > > > > > 2. If a VM does not support a virtualized FIE(like AMUs), it queries the > > > > host CPU frequency by reading a MMIO region of a virtual cpufreq device > > > > to update the guest's frequency scaling factor periodically. This enables > > > > accurate Per-Entity Load Tracking for tasks running in the guest. > > > > > > > > Co-developed-by: Saravana Kannan <saravanak@google.com> > > > > Signed-off-by: Saravana Kannan <saravanak@google.com> > > > > Signed-off-by: David Dai <davidai@google.com> > > > > --- > > > > drivers/cpufreq/Kconfig | 14 ++ > > > > drivers/cpufreq/Makefile | 1 + > > > > drivers/cpufreq/virtual-cpufreq.c | 335 ++++++++++++++++++++++++++++++ > > > > include/linux/arch_topology.h | 1 + > > > > 4 files changed, 351 insertions(+) > > > > create mode 100644 drivers/cpufreq/virtual-cpufreq.c > > > > > > > > diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig > > > > index 94e55c40970a..9aa86bec5793 100644 > > > > --- a/drivers/cpufreq/Kconfig > > > > +++ b/drivers/cpufreq/Kconfig > > > > @@ -217,6 +217,20 @@ config CPUFREQ_DT > > > > > > > > If in doubt, say N. > > > > > > > > +config CPUFREQ_VIRT > > > > + tristate "Virtual cpufreq driver" > > > > + depends on OF && GENERIC_ARCH_TOPOLOGY > > > > + help > > > > + This adds a virtualized cpufreq driver for guest kernels that > > > > + read/writes to a MMIO region for a virtualized cpufreq device to > > > > + communicate with the host. It sends performance requests to the host > > > > + which gets used as a hint to schedule vCPU threads and select CPU > > > > + frequency. If a VM does not support a virtualized FIE such as AMUs, > > > > + it updates the frequency scaling factor by polling host CPU frequency > > > > + to enable accurate Per-Entity Load Tracking for tasks running in the guest. > > > > + > > > > + If in doubt, say N. > > > > + > > > > config CPUFREQ_DT_PLATDEV > > > > tristate "Generic DT based cpufreq platdev driver" > > > > depends on OF > > > > diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile > > > > index 8d141c71b016..eb72ecdc24db 100644 > > > > --- a/drivers/cpufreq/Makefile > > > > +++ b/drivers/cpufreq/Makefile > > > > @@ -16,6 +16,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_ATTR_SET) += cpufreq_governor_attr_set.o > > > > > > > > obj-$(CONFIG_CPUFREQ_DT) += cpufreq-dt.o > > > > obj-$(CONFIG_CPUFREQ_DT_PLATDEV) += cpufreq-dt-platdev.o > > > > +obj-$(CONFIG_CPUFREQ_VIRT) += virtual-cpufreq.o > > > > > > > > # Traces > > > > CFLAGS_amd-pstate-trace.o := -I$(src) > > > > diff --git a/drivers/cpufreq/virtual-cpufreq.c b/drivers/cpufreq/virtual-cpufreq.c > > > > new file mode 100644 > > > > index 000000000000..59ce2bda3913 > > > > --- /dev/null > > > > +++ b/drivers/cpufreq/virtual-cpufreq.c > > > > @@ -0,0 +1,335 @@ > > > > +// SPDX-License-Identifier: GPL-2.0-only > > > > +/* > > > > + * Copyright (C) 2024 Google LLC > > > > + */ > > > > + > > > > +#include <linux/arch_topology.h> > > > > +#include <linux/cpufreq.h> > > > > +#include <linux/init.h> > > > > +#include <linux/sched.h> > > > > +#include <linux/kernel.h> > > > > +#include <linux/module.h> > > > > +#include <linux/of_address.h> > > > > +#include <linux/of_platform.h> > > > > +#include <linux/platform_device.h> > > > > +#include <linux/slab.h> > > > > + > > > > +/* > > > > + * CPU0..CPUn > > > > + * +-------------+-------------------------------+--------+-------+ > > > > + * | Register | Description | Offset | Len | > > > > + * +-------------+-------------------------------+--------+-------+ > > > > + * | cur_perf | read this register to get | 0x0 | 0x4 | > > > > + * | | the current perf (integer val | | | > > > > + * | | representing perf relative to | | | > > > > + * | | max performance) | | | > > > > + * | | that vCPU is running at | | | > > > > + * +-------------+-------------------------------+--------+-------+ > > > > + * | set_perf | write to this register to set | 0x4 | 0x4 | > > > > + * | | perf value of the vCPU | | | > > > > + * +-------------+-------------------------------+--------+-------+ > > > > + * | perftbl_len | number of entries in perf | 0x8 | 0x4 | > > > > + * | | table. A single entry in the | | | > > > > + * | | perf table denotes no table | | | > > > > + * | | and the entry contains | | | > > > > + * | | the maximum perf value | | | > > > > + * | | that this vCPU supports. | | | > > > > + * | | The guest can request any | | | > > > > + * | | value between 1 and max perf | | | > > > > + * | | when perftbls are not used. | | | > > > > + * +---------------------------------------------+--------+-------+ > > > > + * | perftbl_sel | write to this register to | 0xc | 0x4 | > > > > + * | | select perf table entry to | | | > > > > + * | | read from | | | > > > > + * +---------------------------------------------+--------+-------+ > > > > + * | perftbl_rd | read this register to get | 0x10 | 0x4 | > > > > + * | | perf value of the selected | | | > > > > + * | | entry based on perftbl_sel | | | > > > > + * +---------------------------------------------+--------+-------+ > > > > + * | perf_domain | performance domain number | 0x14 | 0x4 | > > > > + * | | that this vCPU belongs to. | | | > > > > + * | | vCPUs sharing the same perf | | | > > > > + * | | domain number are part of the | | | > > > > + * | | same performance domain. | | | > > > > + * +-------------+-------------------------------+--------+-------+ > > > > + */ > > > > + > > > > +#define REG_CUR_PERF_STATE_OFFSET 0x0 > > > > +#define REG_SET_PERF_STATE_OFFSET 0x4 > > > > +#define REG_PERFTBL_LEN_OFFSET 0x8 > > > > +#define REG_PERFTBL_SEL_OFFSET 0xc > > > > +#define REG_PERFTBL_RD_OFFSET 0x10 > > > > +#define REG_PERF_DOMAIN_OFFSET 0x14 > > > > +#define PER_CPU_OFFSET 0x1000 > > > > + > > > > +#define PERFTBL_MAX_ENTRIES 64U > > > > + > > > > +static void __iomem *base; > > > > +static DEFINE_PER_CPU(u32, perftbl_num_entries); > > > > + > > > > +static void virt_scale_freq_tick(void) > > > > +{ > > > > + int cpu = smp_processor_id(); > > > > + u32 max_freq = (u32)cpufreq_get_hw_max_freq(cpu); > > > > + u64 cur_freq; > > > > + unsigned long scale; > > > > + > > > > + cur_freq = (u64)readl_relaxed(base + cpu * PER_CPU_OFFSET > > > > + + REG_CUR_PERF_STATE_OFFSET); > > > > + > > > > + cur_freq <<= SCHED_CAPACITY_SHIFT; > > > > + scale = (unsigned long)div_u64(cur_freq, max_freq); > > > > + scale = min(scale, SCHED_CAPACITY_SCALE); > > > > + > > > > + this_cpu_write(arch_freq_scale, scale); > > > > +} > > > > + > > > > +static struct scale_freq_data virt_sfd = { > > > > + .source = SCALE_FREQ_SOURCE_VIRT, > > > > + .set_freq_scale = virt_scale_freq_tick, > > > > +}; > > > > + > > > > +static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy, > > > > + unsigned int target_freq) > > > > +{ > > > > + writel_relaxed(target_freq, > > > > + base + policy->cpu * PER_CPU_OFFSET + REG_SET_PERF_STATE_OFFSET); > > > > + return 0; > > > > +} > > > > + > > > > +static unsigned int virt_cpufreq_fast_switch(struct cpufreq_policy *policy, > > > > + unsigned int target_freq) > > > > +{ > > > > + virt_cpufreq_set_perf(policy, target_freq); > > > > + return target_freq; > > > > +} > > > > + > > > > +static u32 virt_cpufreq_get_perftbl_entry(int cpu, u32 idx) > > > > +{ > > > > + writel_relaxed(idx, base + cpu * PER_CPU_OFFSET + > > > > + REG_PERFTBL_SEL_OFFSET); > > > > + return readl_relaxed(base + cpu * PER_CPU_OFFSET + > > > > + REG_PERFTBL_RD_OFFSET); > > > > +} > > > > + > > > > +static int virt_cpufreq_target(struct cpufreq_policy *policy, > > > > + unsigned int target_freq, > > > > + unsigned int relation) > > > > +{ > > > > + struct cpufreq_freqs freqs; > > > > + int ret = 0; > > > > + > > > > + freqs.old = policy->cur; > > > > + freqs.new = target_freq; > > > > + > > > > + cpufreq_freq_transition_begin(policy, &freqs); > > > > + ret = virt_cpufreq_set_perf(policy, target_freq); > > > > + cpufreq_freq_transition_end(policy, &freqs, ret != 0); > > > > + > > > > + return ret; > > > > +} > > > > + > > > > +static int virt_cpufreq_get_sharing_cpus(struct cpufreq_policy *policy) > > > > +{ > > > > + u32 cur_perf_domain, perf_domain; > > > > + struct device *cpu_dev; > > > > + int cpu; > > > > + > > > > + cur_perf_domain = readl_relaxed(base + policy->cpu * > > > > + PER_CPU_OFFSET + REG_PERF_DOMAIN_OFFSET); > > > > + > > > > + for_each_possible_cpu(cpu) { > > > > + cpu_dev = get_cpu_device(cpu); > > > > + if (!cpu_dev) > > > > + continue; > > > > + > > > > + perf_domain = readl_relaxed(base + cpu * > > > > + PER_CPU_OFFSET + REG_PERF_DOMAIN_OFFSET); > > > > + > > > > + if (perf_domain == cur_perf_domain) > > > > + cpumask_set_cpu(cpu, policy->cpus); > > > > + } > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +static int virt_cpufreq_get_freq_info(struct cpufreq_policy *policy) > > > > +{ > > > > + struct cpufreq_frequency_table *table; > > > > + u32 num_perftbl_entries, idx; > > > > + > > > > + num_perftbl_entries = per_cpu(perftbl_num_entries, policy->cpu); > > > > + > > > > + if (num_perftbl_entries == 1) { > > > > + policy->cpuinfo.min_freq = 1; > > > > + policy->cpuinfo.max_freq = virt_cpufreq_get_perftbl_entry(policy->cpu, 0); > > > > + > > > > + policy->min = policy->cpuinfo.min_freq; > > > > + policy->max = policy->cpuinfo.max_freq; > > > > + > > > > + policy->cur = policy->max; > > > > + return 0; > > > > + } > > > > + > > > > + table = kcalloc(num_perftbl_entries + 1, sizeof(*table), GFP_KERNEL); > > > > + if (!table) > > > > + return -ENOMEM; > > > > + > > > > + for (idx = 0; idx < num_perftbl_entries; idx++) > > > > + table[idx].frequency = virt_cpufreq_get_perftbl_entry(policy->cpu, idx); > > > > + > > > > + table[idx].frequency = CPUFREQ_TABLE_END; > > > > + policy->freq_table = table; > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +static int virt_cpufreq_cpu_init(struct cpufreq_policy *policy) > > > > +{ > > > > + struct device *cpu_dev; > > > > + int ret; > > > > + > > > > + cpu_dev = get_cpu_device(policy->cpu); > > > > + if (!cpu_dev) > > > > + return -ENODEV; > > > > + > > > > + ret = virt_cpufreq_get_freq_info(policy); > > > > + if (ret) { > > > > + dev_warn(cpu_dev, "failed to get cpufreq info\n"); > > > > + return ret; > > > > + } > > > > + > > > > + ret = virt_cpufreq_get_sharing_cpus(policy); > > > > + if (ret) { > > > > + dev_warn(cpu_dev, "failed to get sharing cpumask\n"); > > > > + return ret; > > > > + } > > > > + > > > > + /* > > > > + * To simplify and improve latency of handling frequency requests on > > > > + * the host side, this ensures that the vCPU thread triggering the MMIO > > > > + * abort is the same thread whose performance constraints (Ex. uclamp > > > > + * settings) need to be updated. This simplifies the VMM (Virtual > > > > + * Machine Manager) having to find the correct vCPU thread and/or > > > > + * facing permission issues when configuring other threads. > > > > + */ > > > > + policy->dvfs_possible_from_any_cpu = false; > > > > + policy->fast_switch_possible = true; > > > > + > > > > + /* > > > > + * Using the default SCALE_FREQ_SOURCE_CPUFREQ is insufficient since > > > > + * the actual physical CPU frequency may not match requested frequency > > > > + * from the vCPU thread due to frequency update latencies or other > > > > + * inputs to the physical CPU frequency selection. This additional FIE > > > > + * source allows for more accurate freq_scale updates and only takes > > > > + * effect if another FIE source such as AMUs have not been registered. > > > > + */ > > > > + topology_set_scale_freq_source(&virt_sfd, policy->cpus); > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +static int virt_cpufreq_cpu_exit(struct cpufreq_policy *policy) > > > > +{ > > > > + topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_VIRT, policy->related_cpus); > > > > + kfree(policy->freq_table); > > > > + return 0; > > > > +} > > > > + > > > > +static int virt_cpufreq_online(struct cpufreq_policy *policy) > > > > +{ > > > > + /* Nothing to restore. */ > > > > + return 0; > > > > +} > > > > + > > > > +static int virt_cpufreq_offline(struct cpufreq_policy *policy) > > > > +{ > > > > + /* Dummy offline() to avoid exit() being called and freeing resources. */ > > > > + return 0; > > > > +} > > > > + > > > > +static int virt_cpufreq_verify_policy(struct cpufreq_policy_data *policy) > > > > +{ > > > > + if (policy->freq_table) > > > > + return cpufreq_frequency_table_verify(policy, policy->freq_table); > > > > + > > > > + cpufreq_verify_within_cpu_limits(policy); > > > > + return 0; > > > > +} > > > > + > > > > +static struct cpufreq_driver cpufreq_virt_driver = { > > > > + .name = "virt-cpufreq", > > > > + .init = virt_cpufreq_cpu_init, > > > > + .exit = virt_cpufreq_cpu_exit, > > > > + .online = virt_cpufreq_online, > > > > + .offline = virt_cpufreq_offline, > > > > + .verify = virt_cpufreq_verify_policy, > > > > + .target = virt_cpufreq_target, > > > > + .fast_switch = virt_cpufreq_fast_switch, > > > > + .attr = cpufreq_generic_attr, > > > > +}; > > > > + > > > > +static int virt_cpufreq_driver_probe(struct platform_device *pdev) > > > > +{ > > > > + u32 num_perftbl_entries; > > > > + int ret, cpu; > > > > + > > > > + base = devm_platform_ioremap_resource(pdev, 0); > > > > + if (IS_ERR(base)) > > > > + return PTR_ERR(base); > > > > + > > > > + for_each_possible_cpu(cpu) { > > > > + num_perftbl_entries = readl_relaxed(base + cpu * PER_CPU_OFFSET + > > > > + REG_PERFTBL_LEN_OFFSET); > > > > + > > > > + if (!num_perftbl_entries || num_perftbl_entries > PERFTBL_MAX_ENTRIES) > > > > + return -ENODEV; > > > > + > > > > + per_cpu(perftbl_num_entries, cpu) = num_perftbl_entries; > > > > + } > > > > + > > > > + ret = cpufreq_register_driver(&cpufreq_virt_driver); > > > > + if (ret) { > > > > + dev_err(&pdev->dev, "Virtual CPUFreq driver failed to register: %d\n", ret); > > > > + return ret; > > > > + } > > > > + > > > > + dev_dbg(&pdev->dev, "Virtual CPUFreq driver initialized\n"); > > > > + return 0; > > > > +} > > > > + > > > > +static int virt_cpufreq_driver_remove(struct platform_device *pdev) > > > > +{ > > > > + cpufreq_unregister_driver(&cpufreq_virt_driver); > > > > + return 0; > > > > +} > > > > + > > > > +static const struct of_device_id virt_cpufreq_match[] = { > > > > + { .compatible = "qemu,virtual-cpufreq", .data = NULL}, > > > > + {} > > > > +}; > > > > +MODULE_DEVICE_TABLE(of, virt_cpufreq_match); > > > > + > > > > +static struct platform_driver virt_cpufreq_driver = { > > > > + .probe = virt_cpufreq_driver_probe, > > > > + .remove = virt_cpufreq_driver_remove, > > > > + .driver = { > > > > + .name = "virt-cpufreq", > > > > + .of_match_table = virt_cpufreq_match, > > > > + }, > > > > +}; > > > > + > > > > +static int __init virt_cpufreq_init(void) > > > > +{ > > > > + return platform_driver_register(&virt_cpufreq_driver); > > > > +} > > > > +postcore_initcall(virt_cpufreq_init); > > > > + > > > > +static void __exit virt_cpufreq_exit(void) > > > > +{ > > > > + platform_driver_unregister(&virt_cpufreq_driver); > > > > +} > > > > +module_exit(virt_cpufreq_exit); > > > > + > > > > +MODULE_DESCRIPTION("Virtual cpufreq driver"); > > > > +MODULE_LICENSE("GPL"); > > > > diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h > > > > index b721f360d759..d5d848849408 100644 > > > > --- a/include/linux/arch_topology.h > > > > +++ b/include/linux/arch_topology.h > > > > @@ -49,6 +49,7 @@ enum scale_freq_source { > > > > SCALE_FREQ_SOURCE_CPUFREQ = 0, > > > > SCALE_FREQ_SOURCE_ARCH, > > > > SCALE_FREQ_SOURCE_CPPC, > > > > + SCALE_FREQ_SOURCE_VIRT, > > > > }; > > > > > > > > struct scale_freq_data { > > > > -- > > > > 2.45.0.215.g3402c0e53f-goog > > > > > > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 2/2] cpufreq: add virtual-cpufreq driver 2024-05-21 4:30 ` [PATCH v6 2/2] cpufreq: add virtual-cpufreq driver David Dai 2024-06-27 21:22 ` David Dai @ 2024-06-28 12:51 ` Sudeep Holla 2024-07-10 9:05 ` Marc Zyngier 1 sibling, 1 reply; 13+ messages in thread From: Sudeep Holla @ 2024-06-28 12:51 UTC (permalink / raw) To: David Dai Cc: Rafael J. Wysocki, Sudeep Holla, Viresh Kumar, Rob Herring, Krzysztof Kozlowski, Conor Dooley, Saravana Kannan, Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel On Mon, May 20, 2024 at 09:30:52PM -0700, David Dai wrote: > Introduce a virtualized cpufreq driver for guest kernels to improve > performance and power of workloads within VMs. > > This driver does two main things: > > 1. Sends the frequency of vCPUs as a hint to the host. The host uses the > hint to schedule the vCPU threads and decide physical CPU frequency. > > 2. If a VM does not support a virtualized FIE(like AMUs), it queries the > host CPU frequency by reading a MMIO region of a virtual cpufreq device > to update the guest's frequency scaling factor periodically. This enables > accurate Per-Entity Load Tracking for tasks running in the guest. > > + > +/* > + * CPU0..CPUn > + * +-------------+-------------------------------+--------+-------+ > + * | Register | Description | Offset | Len | > + * +-------------+-------------------------------+--------+-------+ > + * | cur_perf | read this register to get | 0x0 | 0x4 | > + * | | the current perf (integer val | | | > + * | | representing perf relative to | | | > + * | | max performance) | | | > + * | | that vCPU is running at | | | > + * +-------------+-------------------------------+--------+-------+ > + * | set_perf | write to this register to set | 0x4 | 0x4 | > + * | | perf value of the vCPU | | | > + * +-------------+-------------------------------+--------+-------+ > + * | perftbl_len | number of entries in perf | 0x8 | 0x4 | > + * | | table. A single entry in the | | | > + * | | perf table denotes no table | | | > + * | | and the entry contains | | | > + * | | the maximum perf value | | | > + * | | that this vCPU supports. | | | > + * | | The guest can request any | | | > + * | | value between 1 and max perf | | | > + * | | when perftbls are not used. | | | > + * +---------------------------------------------+--------+-------+ > + * | perftbl_sel | write to this register to | 0xc | 0x4 | > + * | | select perf table entry to | | | > + * | | read from | | | > + * +---------------------------------------------+--------+-------+ > + * | perftbl_rd | read this register to get | 0x10 | 0x4 | > + * | | perf value of the selected | | | > + * | | entry based on perftbl_sel | | | > + * +---------------------------------------------+--------+-------+ > + * | perf_domain | performance domain number | 0x14 | 0x4 | > + * | | that this vCPU belongs to. | | | > + * | | vCPUs sharing the same perf | | | > + * | | domain number are part of the | | | > + * | | same performance domain. | | | > + * +-------------+-------------------------------+--------+-------+ > + */ I think it is good idea to version this table, so that it gives flexibility to update the entries. It is a must if we are getting away with DT. I didn't give complete information in my previous response where I agreed with Rafael. I am not sure how much feasible it is, but can it be queried via KVM IOCTLs to VMM. Just a thought, I am exploring how to make this work even on ACPI systems. It is simpler if we neednot rely on DT or ACPI. -- Regards, Sudeep ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 2/2] cpufreq: add virtual-cpufreq driver 2024-06-28 12:51 ` Sudeep Holla @ 2024-07-10 9:05 ` Marc Zyngier 0 siblings, 0 replies; 13+ messages in thread From: Marc Zyngier @ 2024-07-10 9:05 UTC (permalink / raw) To: Sudeep Holla Cc: David Dai, Rafael J. Wysocki, Viresh Kumar, Rob Herring, Krzysztof Kozlowski, Conor Dooley, Saravana Kannan, Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot, Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel On Fri, 28 Jun 2024 13:51:06 +0100, Sudeep Holla <sudeep.holla@arm.com> wrote: > > On Mon, May 20, 2024 at 09:30:52PM -0700, David Dai wrote: > > Introduce a virtualized cpufreq driver for guest kernels to improve > > performance and power of workloads within VMs. > > > > This driver does two main things: > > > > 1. Sends the frequency of vCPUs as a hint to the host. The host uses the > > hint to schedule the vCPU threads and decide physical CPU frequency. > > > > 2. If a VM does not support a virtualized FIE(like AMUs), it queries the > > host CPU frequency by reading a MMIO region of a virtual cpufreq device > > to update the guest's frequency scaling factor periodically. This enables > > accurate Per-Entity Load Tracking for tasks running in the guest. > > > > + > > +/* > > + * CPU0..CPUn > > + * +-------------+-------------------------------+--------+-------+ > > + * | Register | Description | Offset | Len | > > + * +-------------+-------------------------------+--------+-------+ > > + * | cur_perf | read this register to get | 0x0 | 0x4 | > > + * | | the current perf (integer val | | | > > + * | | representing perf relative to | | | > > + * | | max performance) | | | > > + * | | that vCPU is running at | | | > > + * +-------------+-------------------------------+--------+-------+ > > + * | set_perf | write to this register to set | 0x4 | 0x4 | > > + * | | perf value of the vCPU | | | > > + * +-------------+-------------------------------+--------+-------+ > > + * | perftbl_len | number of entries in perf | 0x8 | 0x4 | > > + * | | table. A single entry in the | | | > > + * | | perf table denotes no table | | | > > + * | | and the entry contains | | | > > + * | | the maximum perf value | | | > > + * | | that this vCPU supports. | | | > > + * | | The guest can request any | | | > > + * | | value between 1 and max perf | | | > > + * | | when perftbls are not used. | | | > > + * +---------------------------------------------+--------+-------+ > > + * | perftbl_sel | write to this register to | 0xc | 0x4 | > > + * | | select perf table entry to | | | > > + * | | read from | | | > > + * +---------------------------------------------+--------+-------+ > > + * | perftbl_rd | read this register to get | 0x10 | 0x4 | > > + * | | perf value of the selected | | | > > + * | | entry based on perftbl_sel | | | > > + * +---------------------------------------------+--------+-------+ > > + * | perf_domain | performance domain number | 0x14 | 0x4 | > > + * | | that this vCPU belongs to. | | | > > + * | | vCPUs sharing the same perf | | | > > + * | | domain number are part of the | | | > > + * | | same performance domain. | | | > > + * +-------------+-------------------------------+--------+-------+ > > + */ > > I think it is good idea to version this table, so that it gives flexibility > to update the entries. It is a must if we are getting away with DT. I didn't > give complete information in my previous response where I agreed with Rafael. > > I am not sure how much feasible it is, but can it be queried via KVM IOCTLs > to VMM. Just a thought, I am exploring how to make this work even on ACPI > systems. It is simpler if we neednot rely on DT or ACPI. KVM should not have to know any of this. This is purely between a contract (and a pretty weak one) between userspace and the guest. M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 0/2] Improve VM CPUfreq and task placement behavior 2024-05-21 4:30 [PATCH v6 0/2] Improve VM CPUfreq and task placement behavior David Dai 2024-05-21 4:30 ` [PATCH v6 1/2] dt-bindings: cpufreq: add virtual cpufreq device David Dai 2024-05-21 4:30 ` [PATCH v6 2/2] cpufreq: add virtual-cpufreq driver David Dai @ 2024-05-21 7:41 ` Viresh Kumar 2024-05-22 18:22 ` David Dai 2 siblings, 1 reply; 13+ messages in thread From: Viresh Kumar @ 2024-05-21 7:41 UTC (permalink / raw) To: David Dai Cc: Rafael J. Wysocki, Rob Herring, Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Saravana Kannan, Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel On 20-05-24, 21:30, David Dai wrote: > v5 -> v6: > -Updated driver to use .target instead of .target_index May have missed the discussion, but why is this done ? -- viresh ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 0/2] Improve VM CPUfreq and task placement behavior 2024-05-21 7:41 ` [PATCH v6 0/2] Improve VM CPUfreq and task placement behavior Viresh Kumar @ 2024-05-22 18:22 ` David Dai 0 siblings, 0 replies; 13+ messages in thread From: David Dai @ 2024-05-22 18:22 UTC (permalink / raw) To: Viresh Kumar Cc: Rafael J. Wysocki, Rob Herring, Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Saravana Kannan, Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel Hi Viresh, On Tue, May 21, 2024 at 12:41 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: > > On 20-05-24, 21:30, David Dai wrote: > > v5 -> v6: > > -Updated driver to use .target instead of .target_index > > May have missed the discussion, but why is this done ? Since the driver now queries the device for frequency info, the interface allows for the VMM(Virtual Machine Manager) to optionally use tables depending on the use case. Target is used in the driver to support both configurations where either the table is used or if only max_perf is used(where the guest can vote from [1-max_perf]). Thanks, David > > -- > viresh ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-08-30 22:52 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-05-21 4:30 [PATCH v6 0/2] Improve VM CPUfreq and task placement behavior David Dai 2024-05-21 4:30 ` [PATCH v6 1/2] dt-bindings: cpufreq: add virtual cpufreq device David Dai 2024-05-22 15:00 ` Rob Herring (Arm) 2024-05-21 4:30 ` [PATCH v6 2/2] cpufreq: add virtual-cpufreq driver David Dai 2024-06-27 21:22 ` David Dai 2024-06-28 12:01 ` Rafael J. Wysocki 2024-06-28 12:42 ` Sudeep Holla 2024-07-10 0:07 ` Saravana Kannan 2024-08-30 22:51 ` David Dai 2024-06-28 12:51 ` Sudeep Holla 2024-07-10 9:05 ` Marc Zyngier 2024-05-21 7:41 ` [PATCH v6 0/2] Improve VM CPUfreq and task placement behavior Viresh Kumar 2024-05-22 18:22 ` David Dai
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).