From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 22C98D1AD53 for ; Wed, 16 Oct 2024 13:05:32 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1t13hx-0004E4-QP; Wed, 16 Oct 2024 09:04:58 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1t13hu-0004Di-MW for qemu-devel@nongnu.org; Wed, 16 Oct 2024 09:04:54 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1t13hq-0000AA-5c for qemu-devel@nongnu.org; Wed, 16 Oct 2024 09:04:54 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1729083887; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YtS7wai1gtWlH9zNK5sIMJ83UyDg+78MLhmpA/pnEV4=; b=TamEW9dpJlRnq4YhS5Zkl8PHdzQSv5+RDEzgzP71bZsqWex4cjfXxIQZAZ/Er81uEkhd4i 2taMBn9sdd2NDXLrXrTc+yVyXnqLUaEvLhB+zgwAhgCEDq6scHhk8oi+YNzHAl+0AsZ2zu i3GDQMmYTBnPs82TLaxKr805hFGaxQM= Received: from mail-lj1-f200.google.com (mail-lj1-f200.google.com [209.85.208.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-546-qP2RzaOIPwaTQu77K7JatQ-1; Wed, 16 Oct 2024 09:04:43 -0400 X-MC-Unique: qP2RzaOIPwaTQu77K7JatQ-1 Received: by mail-lj1-f200.google.com with SMTP id 38308e7fff4ca-2fb58c0df21so17808311fa.2 for ; Wed, 16 Oct 2024 06:04:43 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729083882; x=1729688682; h=in-reply-to:references:user-agent:cc:to:from:subject:message-id :date:content-transfer-encoding:mime-version:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=YtS7wai1gtWlH9zNK5sIMJ83UyDg+78MLhmpA/pnEV4=; b=CNm7LBBl/kJI2vdM/lC5bXb76wae74TZAzm8X7ZFQT6uu4o0UY51AR7OF6O3VVgqRR fQfB6S9QfhovbN21IXTtMBoE0brnHEzUlO2GsOgckkvh+nT+pqnaxqR+Z7dPaf94r/eB xvegjqurNL+bQdOX/7pzQL1b2USbiVy812yQgsGqFVT7mKtuFZC2jmtr0bbNjsSFtWB8 IBLTEhnw6TYpShaxmjB0grYbbxWgqjBQgBMrcU4RZmikalN53Rdao4ou49STKbf0qJ0J 4WN4aCvzUxNKEIroS5Qg9EEaqfnDvgkBCsrTdySmbntWJxMt78QW5Fhf7EeEfqLBBD0n o2BA== X-Forwarded-Encrypted: i=1; AJvYcCXZvPU3m6K5zYtPFG16sgghPV1XT9g6b+cR8HVViM/58iQJvCJRueMmORr/xDCCJMlwb7OM06YHvUk4@nongnu.org X-Gm-Message-State: AOJu0YwuyVCnj5oAeNnq+JAu1UhkVvs8HeLcTWbRCxRsfc+xtXnxcUDN lTGUOLhlw8BBGnkIwpB1RAA/ijYGzDi/YyGsgcyumad65EhJukF+RkscOsKnqdc2KDQ5LUV/Ck3 Y6xF6Cgjd2sa5iyldm1rs51MXFOfHVckIv15WoFc3CmjqJnWYmZr6 X-Received: by 2002:a2e:be04:0:b0:2fa:bf53:1dad with SMTP id 38308e7fff4ca-2fb61ba26a7mr30606151fa.31.1729083881003; Wed, 16 Oct 2024 06:04:41 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFFShum/KY58Ogqn1ofdjlP7jSkLk/1REJcRWUWNmQoymKTMaoDG75/D0cz76z6Y6sCzR4VfA== X-Received: by 2002:a2e:be04:0:b0:2fa:bf53:1dad with SMTP id 38308e7fff4ca-2fb61ba26a7mr30605581fa.31.1729083880122; Wed, 16 Oct 2024 06:04:40 -0700 (PDT) Received: from localhost ([2a01:e0a:a9a:c460:2827:8723:3c60:c84a]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4313f569e9esm48470805e9.18.2024.10.16.06.04.39 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 16 Oct 2024 06:04:39 -0700 (PDT) Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Wed, 16 Oct 2024 15:04:39 +0200 Message-Id: Subject: Re: [PATCH v6 3/3] Add support for RAPL MSRs in KVM/Qemu From: "Anthony Harivel" To: "Igor Mammedov" Cc: , , , , , User-Agent: aerc/0.18.2-69-g1c54bb3a9d11 References: <20240522153453.1230389-1-aharivel@redhat.com> <20240522153453.1230389-4-aharivel@redhat.com> <20241016141751.01443340@imammedo.users.ipa.redhat.com> In-Reply-To: <20241016141751.01443340@imammedo.users.ipa.redhat.com> Received-SPF: pass client-ip=170.10.133.124; envelope-from=aharivel@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.038, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, RCVD_IN_VALIDITY_SAFE_BLOCKED=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Hi Igor, I will let Paolo or Daniel answering those architecture questions, they=20 are more qualified than me.=20 Thanks Anthony Igor Mammedov, Oct 16, 2024 at 14:17: > On Wed, 22 May 2024 17:34:52 +0200 > Anthony Harivel wrote: > >> Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL >> interface (Running Average Power Limit) for advertising the accumulated >> energy consumption of various power domains (e.g. CPU packages, DRAM, >> etc.). >>=20 >> The consumption is reported via MSRs (model specific registers) like >> MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are >> 64 bits registers that represent the accumulated energy consumption in >> micro Joules. They are updated by microcode every ~1ms. >>=20 >> For now, KVM always returns 0 when the guest requests the value of >> these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle >> these MSRs dynamically in userspace. >>=20 >> To limit the amount of system calls for every MSR call, create a new >> thread in QEMU that updates the "virtual" MSR values asynchronously. >>=20 >> Each vCPU has its own vMSR to reflect the independence of vCPUs. The >> thread updates the vMSR values with the ratio of energy consumed of >> the whole physical CPU package the vCPU thread runs on and the >> thread's utime and stime values. >>=20 >> All other non-vCPU threads are also taken into account. Their energy >> consumption is evenly distributed among all vCPUs threads running on >> the same physical CPU package. >>=20 >> To overcome the problem that reading the RAPL MSR requires priviliged >> access, a socket communication between QEMU and the qemu-vmsr-helper is >> mandatory. You can specified the socket path in the parameter. >>=20 >> This feature is activated with -accel kvm,rapl=3Dtrue,path=3D/path/sock.= sock >>=20 >> Actual limitation: >> - Works only on Intel host CPU because AMD CPUs are using different MSR >> adresses. >>=20 >> - Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at >> the moment. >>=20 >> Signed-off-by: Anthony Harivel >> --- >> accel/kvm/kvm-all.c | 27 +++ >> docs/specs/index.rst | 1 + >> docs/specs/rapl-msr.rst | 155 ++++++++++++ >> include/sysemu/kvm_int.h | 32 +++ >> target/i386/cpu.h | 8 + >> target/i386/kvm/kvm.c | 431 +++++++++++++++++++++++++++++++++- >> target/i386/kvm/meson.build | 1 + >> target/i386/kvm/vmsr_energy.c | 344 +++++++++++++++++++++++++++ >> target/i386/kvm/vmsr_energy.h | 99 ++++++++ >> 9 files changed, 1097 insertions(+), 1 deletion(-) >> create mode 100644 docs/specs/rapl-msr.rst >> create mode 100644 target/i386/kvm/vmsr_energy.c >> create mode 100644 target/i386/kvm/vmsr_energy.h >>=20 > > >> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c >> index c0be9f5eedb8..f455e6b987b4 100644 >> --- a/accel/kvm/kvm-all.c >> +++ b/accel/kvm/kvm-all.c >> @@ -3745,6 +3745,21 @@ static void kvm_set_device(Object *obj, >> s->device =3D g_strdup(value); >> } >> =20 >> +static void kvm_set_kvm_rapl(Object *obj, bool value, Error **errp) >> +{ >> + KVMState *s =3D KVM_STATE(obj); >> + s->msr_energy.enable =3D value; >> +} >> + >> +static void kvm_set_kvm_rapl_socket_path(Object *obj, >> + const char *str, >> + Error **errp) >> +{ >> + KVMState *s =3D KVM_STATE(obj); >> + g_free(s->msr_energy.socket_path); >> + s->msr_energy.socket_path =3D g_strdup(str); >> +} >> + >> static void kvm_accel_instance_init(Object *obj) >> { >> KVMState *s =3D KVM_STATE(obj); >> @@ -3764,6 +3779,7 @@ static void kvm_accel_instance_init(Object *obj) >> s->xen_gnttab_max_frames =3D 64; >> s->xen_evtchn_max_pirq =3D 256; >> s->device =3D NULL; >> + s->msr_energy.enable =3D false; >> } >> =20 >> /** >> @@ -3808,6 +3824,17 @@ static void kvm_accel_class_init(ObjectClass *oc,= void *data) >> object_class_property_set_description(oc, "device", >> "Path to the device node to use (default: /dev/kvm)"); >> =20 >> + object_class_property_add_bool(oc, "rapl", >> + NULL, >> + kvm_set_kvm_rapl); >> + object_class_property_set_description(oc, "rapl", >> + "Allow energy related MSRs for RAPL interface in Guest"); >> + >> + object_class_property_add_str(oc, "rapl-helper-socket", NULL, >> + kvm_set_kvm_rapl_socket_path); >> + object_class_property_set_description(oc, "rapl-helper-socket", >> + "Socket Path for comminucating with the Virtual MSR helper daem= on"); >> + >> kvm_arch_accel_class_init(oc); >> } > > it seems, RAPL is x86 specific feature, so why it is in generic KVM code = instead of > target/i386/kvm/kvm.c: kvm_arch_accel_class_init() > >> =20 >> diff --git a/docs/specs/index.rst b/docs/specs/index.rst >> index 1484e3e76077..e738ea7d102f 100644 >> --- a/docs/specs/index.rst >> +++ b/docs/specs/index.rst >> @@ -33,3 +33,4 @@ guest hardware that is specific to QEMU. >> virt-ctlr >> vmcoreinfo >> vmgenid >> + rapl-msr >> diff --git a/docs/specs/rapl-msr.rst b/docs/specs/rapl-msr.rst >> new file mode 100644 >> index 000000000000..1202ee89bee0 >> --- /dev/null >> +++ b/docs/specs/rapl-msr.rst >> @@ -0,0 +1,155 @@ >> +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> +RAPL MSR support >> +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> + >> +The RAPL interface (Running Average Power Limit) is advertising the acc= umulated >> +energy consumption of various power domains (e.g. CPU packages, DRAM, e= tc.). >> + >> +The consumption is reported via MSRs (model specific registers) like >> +MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are = 64 bits >> +registers that represent the accumulated energy consumption in micro Jo= ules. >> + >> +Thanks to the MSR Filtering patch [#a]_ not all MSRs are handled by KVM= . Some >> +of them can now be handled by the userspace (QEMU). It uses a mechanism= called >> +"MSR filtering" where a list of MSRs is given at init time of a VM to K= VM so >> +that a callback is put in place. The design of this patch uses only thi= s >> +mechanism for handling the MSRs between guest/host. >> + >> +At the moment the following MSRs are involved: >> + >> +.. code:: C >> + >> + #define MSR_RAPL_POWER_UNIT 0x00000606 >> + #define MSR_PKG_POWER_LIMIT 0x00000610 >> + #define MSR_PKG_ENERGY_STATUS 0x00000611 >> + #define MSR_PKG_POWER_INFO 0x00000614 >> + >> +The ``*_POWER_UNIT``, ``*_POWER_LIMIT``, ``*_POWER INFO`` are part of t= he RAPL >> +spec and specify the power limit of the package, provide range of param= eter(min >> +power, max power,..) and also the information of the multiplier for the= energy >> +counter to calculate the power. Those MSRs are populated once at the be= ginning >> +by reading the host CPU MSRs and are given back to the guest 1:1 when >> +requested. >> + >> +The MSR_PKG_ENERGY_STATUS is a counter; it represents the total amount = of >> +energy consumed since the last time the register was cleared. If you mu= ltiply >> +it with the UNIT provided above you'll get the power in micro-joules. T= his >> +counter is always increasing and it increases more or less faster depen= ding on >> +the consumption of the package. This counter is supposed to overflow at= some >> +point. >> + >> +Each core belonging to the same Package reading the MSR_PKG_ENERGY_STAT= US (i.e >> +"rdmsr 0x611") will retrieve the same value. The value represents the e= nergy >> +for the whole package. Whatever Core reading it will get the same value= and a >> +core that belongs to PKG-0 will not be able to get the value of PKG-1 a= nd >> +vice-versa. >> + >> +High level implementation >> +------------------------- >> + >> +In order to update the value of the virtual MSR, a QEMU thread is creat= ed. >> +The thread is basically just an infinity loop that does: >> + >> +1. Snapshot of the time metrics of all QEMU threads (Time spent schedul= ed in >> + Userspace and System) >> + >> +2. Snapshot of the actual MSR_PKG_ENERGY_STATUS counter of all packages= where >> + the QEMU threads are running on. >> + >> +3. Sleep for 1 second - During this pause the vcpu and other non-vcpu t= hreads >> + will do what they have to do and so the energy counter will increase= . >> + >> +4. Repeat 2. and 3. and calculate the delta of every metrics representi= ng the >> + time spent scheduled for each QEMU thread *and* the energy spent by = the >> + packages during the pause. >> + >> +5. Filter the vcpu threads and the non-vcpu threads. >> + >> +6. Retrieve the topology of the Virtual Machine. This helps identify wh= ich >> + vCPU is running on which virtual package. >> + >> +7. The total energy spent by the non-vcpu threads is divided by the num= ber >> + of vcpu threads so that each vcpu thread will get an equal part of t= he >> + energy spent by the QEMU workers. >> + >> +8. Calculate the ratio of energy spent per vcpu threads. >> + >> +9. Calculate the energy for each virtual package. >> + >> +10. The virtual MSRs are updated for each virtual package. Each vCPU th= at >> + belongs to the same package will return the same value when accessi= ng the >> + the MSR. >> + >> +11. Loop back to 1. >> + >> +Ratio calculation >> +----------------- >> + >> +In Linux, a process has an execution time associated with it. The sched= uler is >> +dividing the time in clock ticks. The number of clock ticks per second = can be >> +found by the sysconf system call. A typical value of clock ticks per se= cond is >> +100. So a core can run a process at the maximum of 100 ticks per second= . If a >> +package has 4 cores, 400 ticks maximum can be scheduled on all the core= s >> +of the package for a period of 1 second. >> + >> +The /proc/[pid]/stat [#b]_ is a sysfs file that can give the executed t= ime of a >> +process with the [pid] as the process ID. It gives the amount of ticks = the >> +process has been scheduled in userspace (utime) and kernel space (stime= ). >> + >> +By reading those metrics for a thread, one can calculate the ratio of t= ime the >> +package has spent executing the thread. >> + >> +Example: >> + >> +A 4 cores package can schedule a maximum of 400 ticks per second with 1= 00 ticks >> +per second per core. If a thread was scheduled for 100 ticks between a = second >> +on this package, that means my thread has been scheduled for 1/4 of the= whole >> +package. With that, the calculation of the energy spent by the thread o= n this >> +package during this whole second is 1/4 of the total energy spent by th= e >> +package. >> + >> +Usage >> +----- >> + >> +Currently this feature is only working on an Intel CPU that has the RAP= L driver >> +mounted and available in the sysfs. if not, QEMU fails at start-up. >> + >> +This feature is activated with -accel >> +kvm,rapl=3Dtrue,rapl-helper-socket=3D/path/sock.sock >> + >> +It is important that the socket path is the same as the one >> +:program:`qemu-vmsr-helper` is listening to. >> + >> +qemu-vmsr-helper >> +---------------- >> + >> +The qemu-vmsr-helper is working very much like the qemu-pr-helper. Inst= ead of >> +making persistent reservation, qemu-vmsr-helper is here to overcome the >> +CVE-2020-8694 which remove user access to the rapl msr attributes. >> + >> +A socket communication is established between QEMU processes that has t= he RAPL >> +MSR support activated and the qemu-vmsr-helper. A systemd service and s= ocket >> +activation is provided in contrib/systemd/qemu-vmsr-helper.(service/soc= ket). >> + >> +The systemd socket uses 600, like contrib/systemd/qemu-pr-helper.socket= . The >> +socket can be passed via SCM_RIGHTS by libvirt, or its permissions can = be >> +changed (e.g. 660 and root:kvm for a Debian system for example). Libvir= t could >> +also start a separate helper if needed. All in all, the policy is left = to the >> +user. >> + >> +See the qemu-pr-helper documentation or manpage for further details. >> + >> +Current Limitations >> +------------------- >> + >> +- Works only on Intel host CPUs because AMD CPUs are using different MS= R >> + addresses. >> + >> +- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at t= he >> + moment. >> + >> +References >> +---------- >> + >> +.. [#a] https://patchwork.kernel.org/project/kvm/patch/20200916202951.2= 3760-7-graf@amazon.com/ >> +.. [#b] https://man7.org/linux/man-pages/man5/proc.5.html >> diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h >> index 3f3d13f81669..1d8fb1473bdf 100644 >> --- a/include/sysemu/kvm_int.h >> +++ b/include/sysemu/kvm_int.h >> @@ -14,6 +14,9 @@ >> #include "qemu/accel.h" >> #include "qemu/queue.h" >> #include "sysemu/kvm.h" >> +#include "hw/boards.h" >> +#include "hw/i386/topology.h" >> +#include "io/channel-socket.h" > > > I'm skeptical about pulling in x86 specific headers into generic kvm head= er, > (it's miracle that it builds at all), and by extension 'board.h' as well > > can it be refactored in a way that you won't need to pull in 'board.h' > and avoid using x86 specific code in generic KVM code? > (which also applies to added below KVMHostTopoInfo) > > >> typedef struct KVMSlot >> { >> @@ -50,6 +53,34 @@ typedef struct KVMMemoryListener { >> =20 >> #define KVM_MSI_HASHTAB_SIZE 256 >> =20 >> +typedef struct KVMHostTopoInfo { >> + /* Number of package on the Host */ >> + unsigned int maxpkgs; >> + /* Number of cpus on the Host */ >> + unsigned int maxcpus; >> + /* Number of cpus on each different package */ >> + unsigned int *pkg_cpu_count; >> + /* Each package can have different maxticks */ >> + unsigned int *maxticks; >> +} KVMHostTopoInfo; >> + >> +struct KVMMsrEnergy { >> + pid_t pid; >> + bool enable; >> + char *socket_path; >> + QIOChannelSocket *sioc; >> + QemuThread msr_thr; >> + unsigned int guest_vcpus; >> + unsigned int guest_vsockets; >> + X86CPUTopoInfo guest_topo_info; >> + KVMHostTopoInfo host_topo; >> + const CPUArchIdList *guest_cpu_list; >> + uint64_t *msr_value; >> + uint64_t msr_unit; >> + uint64_t msr_limit; >> + uint64_t msr_info; >> +}; >> + >> enum KVMDirtyRingReaperState { >> KVM_DIRTY_RING_REAPER_NONE =3D 0, >> /* The reaper is sleeping */ >> @@ -117,6 +148,7 @@ struct KVMState >> bool kvm_dirty_ring_with_bitmap; >> uint64_t kvm_eager_split_size; /* Eager Page Splitting chunk size = */ >> struct KVMDirtyRingReaper reaper; >> + struct KVMMsrEnergy msr_energy; >> NotifyVmexitOption notify_vmexit; >> uint32_t notify_window; >> uint32_t xen_version; >> diff --git a/target/i386/cpu.h b/target/i386/cpu.h >> index ccccb62fc353..c3891c1a6b4e 100644 >> --- a/target/i386/cpu.h >> +++ b/target/i386/cpu.h >> @@ -397,6 +397,10 @@ typedef enum X86Seg { >> #define MSR_IA32_TSX_CTRL 0x122 >> #define MSR_IA32_TSCDEADLINE 0x6e0 >> #define MSR_IA32_PKRS 0x6e1 >> +#define MSR_RAPL_POWER_UNIT 0x00000606 >> +#define MSR_PKG_POWER_LIMIT 0x00000610 >> +#define MSR_PKG_ENERGY_STATUS 0x00000611 >> +#define MSR_PKG_POWER_INFO 0x00000614 >> #define MSR_ARCH_LBR_CTL 0x000014ce >> #define MSR_ARCH_LBR_DEPTH 0x000014cf >> #define MSR_ARCH_LBR_FROM_0 0x00001500 >> @@ -1790,6 +1794,10 @@ typedef struct CPUArchState { >> =20 >> uintptr_t retaddr; >> =20 >> + /* RAPL MSR */ >> + uint64_t msr_rapl_power_unit; >> + uint64_t msr_pkg_energy_status; >> + >> /* Fields up to this point are cleared by a CPU reset */ >> struct {} end_reset_fields; >> =20 >> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c >> index c5943605ee3a..8767c8e06028 100644 >> --- a/target/i386/kvm/kvm.c >> +++ b/target/i386/kvm/kvm.c > > I'd also suggest to move most of rapl related code added here to vmsr_ene= rgy.c, > and leave only msr access plumbing here. > (this file is already huge and adding 400+ loc isn't helping its readabil= ity at all) > >> @@ -16,9 +16,12 @@ >> #include "qapi/qapi-events-run-state.h" >> #include "qapi/error.h" >> #include "qapi/visitor.h" >> +#include >> #include >> #include >> #include >> +#include >> +#include >> =20 >> #include >> #include "standard-headers/asm-x86/kvm_para.h" >> @@ -26,6 +29,7 @@ >> =20 >> #include "cpu.h" >> #include "host-cpu.h" >> +#include "vmsr_energy.h" >> #include "sysemu/sysemu.h" >> #include "sysemu/hw_accel.h" >> #include "sysemu/kvm_int.h" >> @@ -2519,7 +2523,8 @@ static int kvm_get_supported_msrs(KVMState *s) >> return ret; >> } >> =20 >> -static bool kvm_rdmsr_core_thread_count(X86CPU *cpu, uint32_t msr, >> +static bool kvm_rdmsr_core_thread_count(X86CPU *cpu, >> + uint32_t msr, >> uint64_t *val) >> { >> CPUState *cs =3D CPU(cpu); >> @@ -2530,6 +2535,53 @@ static bool kvm_rdmsr_core_thread_count(X86CPU *c= pu, uint32_t msr, >> return true; >> } >> =20 >> +static bool kvm_rdmsr_rapl_power_unit(X86CPU *cpu, >> + uint32_t msr, >> + uint64_t *val) >> +{ >> + >> + CPUState *cs =3D CPU(cpu); >> + >> + *val =3D cs->kvm_state->msr_energy.msr_unit; >> + >> + return true; >> +} >> + >> +static bool kvm_rdmsr_pkg_power_limit(X86CPU *cpu, >> + uint32_t msr, >> + uint64_t *val) >> +{ >> + >> + CPUState *cs =3D CPU(cpu); >> + >> + *val =3D cs->kvm_state->msr_energy.msr_limit; >> + >> + return true; >> +} >> + >> +static bool kvm_rdmsr_pkg_power_info(X86CPU *cpu, >> + uint32_t msr, >> + uint64_t *val) >> +{ >> + >> + CPUState *cs =3D CPU(cpu); >> + >> + *val =3D cs->kvm_state->msr_energy.msr_info; >> + >> + return true; >> +} >> + >> +static bool kvm_rdmsr_pkg_energy_status(X86CPU *cpu, >> + uint32_t msr, >> + uint64_t *val) >> +{ >> + >> + CPUState *cs =3D CPU(cpu); >> + *val =3D cs->kvm_state->msr_energy.msr_value[cs->cpu_index]; >> + >> + return true; >> +} >> + >> static Notifier smram_machine_done; >> static KVMMemoryListener smram_listener; >> static AddressSpace smram_address_space; >> @@ -2564,6 +2616,340 @@ static void register_smram_listener(Notifier *n,= void *unused) >> &smram_address_space, 1, "kvm-smram"); >> } >> =20 >> +static void *kvm_msr_energy_thread(void *data) >> +{ >> + KVMState *s =3D data; >> + struct KVMMsrEnergy *vmsr =3D &s->msr_energy; >> + >> + g_autofree vmsr_package_energy_stat *pkg_stat =3D NULL; >> + g_autofree vmsr_thread_stat *thd_stat =3D NULL; >> + g_autofree CPUState *cpu =3D NULL; >> + g_autofree unsigned int *vpkgs_energy_stat =3D NULL; >> + unsigned int num_threads =3D 0; >> + >> + X86CPUTopoIDs topo_ids; >> + >> + rcu_register_thread(); >> + >> + /* Allocate memory for each package energy status */ >> + pkg_stat =3D g_new0(vmsr_package_energy_stat, vmsr->host_topo.maxpk= gs); >> + >> + /* Allocate memory for thread stats */ >> + thd_stat =3D g_new0(vmsr_thread_stat, 1); >> + >> + /* Allocate memory for holding virtual package energy counter */ >> + vpkgs_energy_stat =3D g_new0(unsigned int, vmsr->guest_vsockets); >> + >> + /* Populate the max tick of each packages */ >> + for (int i =3D 0; i < vmsr->host_topo.maxpkgs; i++) { >> + /* >> + * Max numbers of ticks per package >> + * Time in second * Number of ticks/second * Number of cores/pa= ckage >> + * ex: 100 ticks/second/CPU, 12 CPUs per Package gives 1200 tic= ks max >> + */ >> + vmsr->host_topo.maxticks[i] =3D (MSR_ENERGY_THREAD_SLEEP_US / 1= 000000) >> + * sysconf(_SC_CLK_TCK) >> + * vmsr->host_topo.pkg_cpu_count[i]; >> + } >> + >> + while (true) { >> + /* Get all qemu threads id */ >> + g_autofree pid_t *thread_ids =3D >> + thread_ids =3D vmsr_get_thread_ids(vmsr->pid, &num_threads)= ; >> + >> + if (thread_ids =3D=3D NULL) { >> + goto clean; >> + } >> + >> + thd_stat =3D g_renew(vmsr_thread_stat, thd_stat, num_threads); >> + /* Unlike g_new0, g_renew0 function doesn't exist yet... */ >> + memset(thd_stat, 0, num_threads * sizeof(vmsr_thread_stat)); >> + >> + /* Populate all the thread stats */ >> + for (int i =3D 0; i < num_threads; i++) { >> + thd_stat[i].utime =3D g_new0(unsigned long long, 2); >> + thd_stat[i].stime =3D g_new0(unsigned long long, 2); >> + thd_stat[i].thread_id =3D thread_ids[i]; >> + vmsr_read_thread_stat(vmsr->pid, >> + thd_stat[i].thread_id, >> + thd_stat[i].utime, >> + thd_stat[i].stime, >> + &thd_stat[i].cpu_id); >> + thd_stat[i].pkg_id =3D >> + vmsr_get_physical_package_id(thd_stat[i].cpu_id); >> + } >> + >> + /* Retrieve all packages power plane energy counter */ >> + for (int i =3D 0; i < vmsr->host_topo.maxpkgs; i++) { >> + for (int j =3D 0; j < num_threads; j++) { >> + /* >> + * Use the first thread we found that ran on the CPU >> + * of the package to read the packages energy counter >> + */ >> + if (thd_stat[j].pkg_id =3D=3D i) { >> + pkg_stat[i].e_start =3D >> + vmsr_read_msr(MSR_PKG_ENERGY_STATUS, >> + thd_stat[j].cpu_id, >> + thd_stat[j].thread_id, >> + s->msr_energy.sioc); >> + break; >> + } >> + } >> + } >> + >> + /* Sleep a short period while the other threads are working */ >> + usleep(MSR_ENERGY_THREAD_SLEEP_US); >> + >> + /* >> + * Retrieve all packages power plane energy counter >> + * Calculate the delta of all packages >> + */ >> + for (int i =3D 0; i < vmsr->host_topo.maxpkgs; i++) { >> + for (int j =3D 0; j < num_threads; j++) { >> + /* >> + * Use the first thread we found that ran on the CPU >> + * of the package to read the packages energy counter >> + */ >> + if (thd_stat[j].pkg_id =3D=3D i) { >> + pkg_stat[i].e_end =3D >> + vmsr_read_msr(MSR_PKG_ENERGY_STATUS, >> + thd_stat[j].cpu_id, >> + thd_stat[j].thread_id, >> + s->msr_energy.sioc); >> + /* >> + * Prevent the case we have migrate the VM >> + * during the sleep period or any other cases >> + * were energy counter might be lower after >> + * the sleep period. >> + */ >> + if (pkg_stat[i].e_end > pkg_stat[i].e_start) { >> + pkg_stat[i].e_delta =3D >> + pkg_stat[i].e_end - pkg_stat[i].e_start; >> + } else { >> + pkg_stat[i].e_delta =3D 0; >> + } >> + break; >> + } >> + } >> + } >> + >> + /* Delta of ticks spend by each thread between the sample */ >> + for (int i =3D 0; i < num_threads; i++) { >> + vmsr_read_thread_stat(vmsr->pid, >> + thd_stat[i].thread_id, >> + thd_stat[i].utime, >> + thd_stat[i].stime, >> + &thd_stat[i].cpu_id); >> + >> + if (vmsr->pid < 0) { >> + /* >> + * We don't count the dead thread >> + * i.e threads that existed before the sleep >> + * and not anymore >> + */ >> + thd_stat[i].delta_ticks =3D 0; >> + } else { >> + vmsr_delta_ticks(thd_stat, i); >> + } >> + } >> + >> + /* >> + * Identify the vcpu threads >> + * Calculate the number of vcpu per package >> + */ >> + CPU_FOREACH(cpu) { >> + for (int i =3D 0; i < num_threads; i++) { >> + if (cpu->thread_id =3D=3D thd_stat[i].thread_id) { >> + thd_stat[i].is_vcpu =3D true; >> + thd_stat[i].vcpu_id =3D cpu->cpu_index; >> + pkg_stat[thd_stat[i].pkg_id].nb_vcpu++; >> + thd_stat[i].acpi_id =3D kvm_arch_vcpu_id(cpu); >> + break; >> + } >> + } >> + } >> + >> + /* Retrieve the virtual package number of each vCPU */ >> + for (int i =3D 0; i < vmsr->guest_cpu_list->len; i++) { >> + for (int j =3D 0; j < num_threads; j++) { >> + if ((thd_stat[j].acpi_id =3D=3D >> + vmsr->guest_cpu_list->cpus[i].arch_id) >> + && (thd_stat[j].is_vcpu =3D=3D true)) { >> + x86_topo_ids_from_apicid(thd_stat[j].acpi_id, >> + &vmsr->guest_topo_info, &topo_ids); >> + thd_stat[j].vpkg_id =3D topo_ids.pkg_id; >> + } >> + } >> + } >> + >> + /* Calculate the total energy of all non-vCPU thread */ >> + for (int i =3D 0; i < num_threads; i++) { >> + if ((thd_stat[i].is_vcpu !=3D true) && >> + (thd_stat[i].delta_ticks > 0)) { >> + double temp; >> + temp =3D vmsr_get_ratio(pkg_stat[thd_stat[i].pkg_id].e_= delta, >> + thd_stat[i].delta_ticks, >> + vmsr->host_topo.maxticks[thd_stat[i].pkg_id]); >> + pkg_stat[thd_stat[i].pkg_id].e_ratio >> + +=3D (uint64_t)lround(temp); >> + } >> + } >> + >> + /* Calculate the ratio per non-vCPU thread of each package */ >> + for (int i =3D 0; i < vmsr->host_topo.maxpkgs; i++) { >> + if (pkg_stat[i].nb_vcpu > 0) { >> + pkg_stat[i].e_ratio =3D pkg_stat[i].e_ratio / pkg_stat[= i].nb_vcpu; >> + } >> + } >> + >> + /* >> + * Calculate the energy for each Package: >> + * Energy Package =3D sum of each vCPU energy that belongs to t= he package >> + */ >> + for (int i =3D 0; i < num_threads; i++) { >> + if ((thd_stat[i].is_vcpu =3D=3D true) && \ >> + (thd_stat[i].delta_ticks > 0)) { >> + double temp; >> + temp =3D vmsr_get_ratio(pkg_stat[thd_stat[i].pkg_id].e_= delta, >> + thd_stat[i].delta_ticks, >> + vmsr->host_topo.maxticks[thd_stat[i].pkg_id]); >> + vpkgs_energy_stat[thd_stat[i].vpkg_id] +=3D >> + (uint64_t)lround(temp); >> + vpkgs_energy_stat[thd_stat[i].vpkg_id] +=3D >> + pkg_stat[thd_stat[i].pkg_id].e_ratio; >> + } >> + } >> + >> + /* >> + * Finally populate the vmsr register of each vCPU with the tot= al >> + * package value to emulate the real hardware where each CPU re= turn the >> + * value of the package it belongs. >> + */ >> + for (int i =3D 0; i < num_threads; i++) { >> + if ((thd_stat[i].is_vcpu =3D=3D true) && \ >> + (thd_stat[i].delta_ticks > 0)) { >> + vmsr->msr_value[thd_stat[i].vcpu_id] =3D \ >> + vpkgs_energy_stat[thd_stat[i].v= pkg_id]; >> + } >> + } >> + >> + /* Freeing memory before zeroing the pointer */ >> + for (int i =3D 0; i < num_threads; i++) { >> + g_free(thd_stat[i].utime); >> + g_free(thd_stat[i].stime); >> + } >> + } >> + >> +clean: >> + rcu_unregister_thread(); >> + return NULL; >> +} >> + >> +static int kvm_msr_energy_thread_init(KVMState *s, MachineState *ms) >> +{ >> + MachineClass *mc =3D MACHINE_GET_CLASS(ms); >> + struct KVMMsrEnergy *r =3D &s->msr_energy; >> + int ret =3D 0; >> + >> + /* >> + * Sanity check >> + * 1. Host cpu must be Intel cpu >> + * 2. RAPL must be enabled on the Host >> + */ >> + if (is_host_cpu_intel()) { >> + error_report("The RAPL feature can only be enabled on hosts\ >> + with Intel CPU models"); >> + ret =3D 1; >> + goto out; >> + } >> + >> + if (!is_rapl_enabled()) { >> + ret =3D 1; >> + goto out; >> + } >> + >> + /* Retrieve the virtual topology */ >> + vmsr_init_topo_info(&r->guest_topo_info, ms); >> + >> + /* Retrieve the number of vcpu */ >> + r->guest_vcpus =3D ms->smp.cpus; >> + >> + /* Retrieve the number of virtual sockets */ >> + r->guest_vsockets =3D ms->smp.sockets; >> + >> + /* Allocate register memory (MSR_PKG_STATUS) for each vcpu */ >> + r->msr_value =3D g_new0(uint64_t, r->guest_vcpus); >> + >> + /* Retrieve the CPUArchIDlist */ >> + r->guest_cpu_list =3D mc->possible_cpu_arch_ids(ms); >> + >> + /* Max number of cpus on the Host */ >> + r->host_topo.maxcpus =3D vmsr_get_maxcpus(); >> + if (r->host_topo.maxcpus =3D=3D 0) { >> + error_report("host max cpus =3D 0"); >> + ret =3D 1; >> + goto out; >> + } >> + >> + /* Max number of packages on the host */ >> + r->host_topo.maxpkgs =3D vmsr_get_max_physical_package(r->host_topo= .maxcpus); >> + if (r->host_topo.maxpkgs =3D=3D 0) { >> + error_report("host max pkgs =3D 0"); >> + ret =3D 1; >> + goto out; >> + } >> + >> + /* Allocate memory for each package on the host */ >> + r->host_topo.pkg_cpu_count =3D g_new0(unsigned int, r->host_topo.ma= xpkgs); >> + r->host_topo.maxticks =3D g_new0(unsigned int, r->host_topo.maxpkgs= ); >> + >> + vmsr_count_cpus_per_package(r->host_topo.pkg_cpu_count, >> + r->host_topo.maxpkgs); >> + for (int i =3D 0; i < r->host_topo.maxpkgs; i++) { >> + if (r->host_topo.pkg_cpu_count[i] =3D=3D 0) { >> + error_report("cpu per packages =3D 0 on package_%d", i); >> + ret =3D 1; >> + goto out; >> + } >> + } >> + >> + /* Get QEMU PID*/ >> + r->pid =3D getpid(); >> + >> + /* Compute the socket path if necessary */ >> + if (s->msr_energy.socket_path =3D=3D NULL) { >> + s->msr_energy.socket_path =3D vmsr_compute_default_paths(); >> + } >> + >> + /* Open socket with vmsr helper */ >> + s->msr_energy.sioc =3D vmsr_open_socket(s->msr_energy.socket_path); >> + >> + if (s->msr_energy.sioc =3D=3D NULL) { >> + error_report("vmsr socket opening failed"); >> + ret =3D 1; >> + goto out; >> + } >> + >> + /* Those MSR values should not change */ >> + r->msr_unit =3D vmsr_read_msr(MSR_RAPL_POWER_UNIT, 0, r->pid, >> + s->msr_energy.sioc); >> + r->msr_limit =3D vmsr_read_msr(MSR_PKG_POWER_LIMIT, 0, r->pid, >> + s->msr_energy.sioc); >> + r->msr_info =3D vmsr_read_msr(MSR_PKG_POWER_INFO, 0, r->pid, >> + s->msr_energy.sioc); >> + if (r->msr_unit =3D=3D 0 || r->msr_limit =3D=3D 0 || r->msr_info = =3D=3D 0) { >> + error_report("can't read any virtual msr"); >> + ret =3D 1; >> + goto out; >> + } >> + >> + qemu_thread_create(&r->msr_thr, "kvm-msr", >> + kvm_msr_energy_thread, >> + s, QEMU_THREAD_JOINABLE); >> +out: >> + return ret; >> +} >> + >> int kvm_arch_get_default_type(MachineState *ms) >> { >> return 0; >> @@ -2768,6 +3154,49 @@ int kvm_arch_init(MachineState *ms, KVMState *s) >> strerror(-ret)); >> exit(1); >> } >> + >> + if (s->msr_energy.enable =3D=3D true) { >> + r =3D kvm_filter_msr(s, MSR_RAPL_POWER_UNIT, >> + kvm_rdmsr_rapl_power_unit, NULL); >> + if (!r) { >> + error_report("Could not install MSR_RAPL_POWER_UNIT \ >> + handler: %s", >> + strerror(-ret)); >> + exit(1); >> + } >> + >> + r =3D kvm_filter_msr(s, MSR_PKG_POWER_LIMIT, >> + kvm_rdmsr_pkg_power_limit, NULL); >> + if (!r) { >> + error_report("Could not install MSR_PKG_POWER_LIMIT \ >> + handler: %s", >> + strerror(-ret)); >> + exit(1); >> + } >> + >> + r =3D kvm_filter_msr(s, MSR_PKG_POWER_INFO, >> + kvm_rdmsr_pkg_power_info, NULL); >> + if (!r) { >> + error_report("Could not install MSR_PKG_POWER_INFO \ >> + handler: %s", >> + strerror(-ret)); >> + exit(1); >> + } >> + r =3D kvm_filter_msr(s, MSR_PKG_ENERGY_STATUS, >> + kvm_rdmsr_pkg_energy_status, NULL); >> + if (!r) { >> + error_report("Could not install MSR_PKG_ENERGY_STATUS \ >> + handler: %s", >> + strerror(-ret)); >> + exit(1); >> + } >> + r =3D kvm_msr_energy_thread_init(s, ms); >> + if (r) { >> + error_report("kvm : error RAPL feature requirement not = meet"); >> + exit(1); >> + } >> + >> + } >> } >> =20 >> return 0; >> diff --git a/target/i386/kvm/meson.build b/target/i386/kvm/meson.build >> index e7850981e62d..3996cafaf29f 100644 >> --- a/target/i386/kvm/meson.build >> +++ b/target/i386/kvm/meson.build >> @@ -3,6 +3,7 @@ i386_kvm_ss =3D ss.source_set() >> i386_kvm_ss.add(files( >> 'kvm.c', >> 'kvm-cpu.c', >> + 'vmsr_energy.c', >> )) >> =20 >> i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen-emu.c')) >> diff --git a/target/i386/kvm/vmsr_energy.c b/target/i386/kvm/vmsr_energy= .c >> new file mode 100644 >> index 000000000000..acf0fc0a2fb3 >> --- /dev/null >> +++ b/target/i386/kvm/vmsr_energy.c >> @@ -0,0 +1,344 @@ >> +/* >> + * QEMU KVM support -- x86 virtual RAPL msr >> + * >> + * Copyright 2024 Red Hat, Inc. 2024 >> + * >> + * Author: >> + * Anthony Harivel >> + * >> + * This work is licensed under the terms of the GNU GPL, version 2 or l= ater. >> + * See the COPYING file in the top-level directory. >> + * >> + */ >> + >> +#include "qemu/osdep.h" >> +#include "qemu/error-report.h" >> +#include "vmsr_energy.h" >> +#include "io/channel.h" >> +#include "io/channel-socket.h" >> +#include "hw/boards.h" >> +#include "cpu.h" >> +#include "host-cpu.h" >> + >> +char *vmsr_compute_default_paths(void) >> +{ >> + g_autofree char *state =3D qemu_get_local_state_dir(); >> + >> + return g_build_filename(state, "run", "qemu-vmsr-helper.sock", NULL= ); >> +} >> + >> +bool is_host_cpu_intel(void) >> +{ >> + int family, model, stepping; >> + char vendor[CPUID_VENDOR_SZ + 1]; >> + >> + host_cpu_vendor_fms(vendor, &family, &model, &stepping); >> + >> + return strcmp(vendor, CPUID_VENDOR_INTEL); >> +} >> + >> +int is_rapl_enabled(void) >> +{ >> + const char *path =3D "/sys/class/powercap/intel-rapl/enabled"; >> + FILE *file =3D fopen(path, "r"); >> + int value =3D 0; >> + >> + if (file !=3D NULL) { >> + if (fscanf(file, "%d", &value) !=3D 1) { >> + error_report("INTEL RAPL not enabled"); >> + } >> + fclose(file); >> + } else { >> + error_report("Error opening %s", path); >> + } >> + >> + return value; >> +} >> + >> +QIOChannelSocket *vmsr_open_socket(const char *path) >> +{ >> + g_autofree char *socket_path =3D NULL; >> + >> + socket_path =3D g_strdup(path); >> + >> + SocketAddress saddr =3D { >> + .type =3D SOCKET_ADDRESS_TYPE_UNIX, >> + .u.q_unix.path =3D socket_path >> + }; >> + >> + QIOChannelSocket *sioc =3D qio_channel_socket_new(); >> + Error *local_err =3D NULL; >> + >> + qio_channel_set_name(QIO_CHANNEL(sioc), "vmsr-helper"); >> + qio_channel_socket_connect_sync(sioc, >> + &saddr, >> + &local_err); >> + if (local_err) { >> + /* Close socket. */ >> + qio_channel_close(QIO_CHANNEL(sioc), NULL); >> + object_unref(OBJECT(sioc)); >> + sioc =3D NULL; >> + goto out; >> + } >> + >> + qio_channel_set_delay(QIO_CHANNEL(sioc), false); >> +out: >> + return sioc; >> +} >> + >> +uint64_t vmsr_read_msr(uint32_t reg, uint32_t cpu_id, uint32_t tid, >> + QIOChannelSocket *sioc) >> +{ >> + uint64_t data =3D 0; >> + int r =3D 0; >> + Error *local_err =3D NULL; >> + uint32_t buffer[3]; >> + /* >> + * Send the required arguments: >> + * 1. RAPL MSR register to read >> + * 2. On which CPU ID >> + * 3. From which vCPU (Thread ID) >> + */ >> + buffer[0] =3D reg; >> + buffer[1] =3D cpu_id; >> + buffer[2] =3D tid; >> + >> + r =3D qio_channel_write_all(QIO_CHANNEL(sioc), >> + (char *)buffer, sizeof(buffer), >> + &local_err); >> + if (r < 0) { >> + goto out_close; >> + } >> + >> + r =3D qio_channel_read(QIO_CHANNEL(sioc), >> + (char *)&data, sizeof(data), >> + &local_err); >> + if (r < 0) { >> + data =3D 0; >> + goto out_close; >> + } >> + >> +out_close: >> + return data; >> +} >> + >> +/* Retrieve the max number of physical package */ >> +unsigned int vmsr_get_max_physical_package(unsigned int max_cpus) >> +{ >> + const char *dir =3D "/sys/devices/system/cpu/"; >> + const char *topo_path =3D "topology/physical_package_id"; >> + g_autofree int *uniquePackages =3D g_new0(int, max_cpus); >> + unsigned int packageCount =3D 0; >> + FILE *file =3D NULL; >> + >> + for (int i =3D 0; i < max_cpus; i++) { >> + g_autofree char *filePath =3D NULL; >> + g_autofree char *cpuid =3D g_strdup_printf("cpu%d", i); >> + >> + filePath =3D g_build_filename(dir, cpuid, topo_path, NULL); >> + >> + file =3D fopen(filePath, "r"); >> + >> + if (file =3D=3D NULL) { >> + error_report("Error opening physical_package_id file"); >> + return 0; >> + } >> + >> + char packageId[10]; >> + if (fgets(packageId, sizeof(packageId), file) =3D=3D NULL) { >> + packageCount =3D 0; >> + } >> + >> + fclose(file); >> + >> + int currentPackageId =3D atoi(packageId); >> + >> + bool isUnique =3D true; >> + for (int j =3D 0; j < packageCount; j++) { >> + if (uniquePackages[j] =3D=3D currentPackageId) { >> + isUnique =3D false; >> + break; >> + } >> + } >> + >> + if (isUnique) { >> + uniquePackages[packageCount] =3D currentPackageId; >> + packageCount++; >> + >> + if (packageCount >=3D max_cpus) { >> + break; >> + } >> + } >> + } >> + >> + return (packageCount =3D=3D 0) ? 1 : packageCount; >> +} >> + >> +/* Retrieve the max number of physical cpu on the host */ >> +unsigned int vmsr_get_maxcpus(void) >> +{ >> + GDir *dir; >> + const gchar *entry_name; >> + unsigned int cpu_count =3D 0; >> + const char *path =3D "/sys/devices/system/cpu/"; >> + >> + dir =3D g_dir_open(path, 0, NULL); >> + if (dir =3D=3D NULL) { >> + error_report("Unable to open cpu directory"); >> + return -1; >> + } >> + >> + while ((entry_name =3D g_dir_read_name(dir)) !=3D NULL) { >> + if (g_ascii_strncasecmp(entry_name, "cpu", 3) =3D=3D 0 && >> + isdigit(entry_name[3])) { >> + cpu_count++; >> + } >> + } >> + >> + g_dir_close(dir); >> + >> + return cpu_count; >> +} >> + >> +/* Count the number of physical cpu on each packages */ >> +unsigned int vmsr_count_cpus_per_package(unsigned int *package_count, >> + unsigned int max_pkgs) >> +{ >> + g_autofree char *file_contents =3D NULL; >> + g_autofree char *path =3D NULL; >> + g_autofree char *path_name =3D NULL; >> + gsize length; >> + >> + /* Iterate over cpus and count cpus in each package */ >> + for (int cpu_id =3D 0; ; cpu_id++) { >> + path_name =3D g_strdup_printf("/sys/devices/system/cpu/cpu%d/" >> + "topology/physical_package_id", cpu_id); >> + >> + path =3D g_build_filename(path_name, NULL); >> + >> + if (!g_file_get_contents(path, &file_contents, &length, NULL)) = { >> + break; /* No more cpus */ >> + } >> + >> + /* Get the physical package ID for this CPU */ >> + int package_id =3D atoi(file_contents); >> + >> + /* Check if the package ID is within the known number of packag= es */ >> + if (package_id >=3D 0 && package_id < max_pkgs) { >> + /* If yes, count the cpu for this package*/ >> + package_count[package_id]++; >> + } >> + } >> + >> + return 0; >> +} >> + >> +/* Get the physical package id from a given cpu id */ >> +int vmsr_get_physical_package_id(int cpu_id) >> +{ >> + g_autofree char *file_contents =3D NULL; >> + g_autofree char *file_path =3D NULL; >> + int package_id =3D -1; >> + gsize length; >> + >> + file_path =3D g_strdup_printf("/sys/devices/system/cpu/cpu%d" >> + "/topology/physical_package_id", cpu_id); >> + >> + if (!g_file_get_contents(file_path, &file_contents, &length, NULL))= { >> + goto out; >> + } >> + >> + package_id =3D atoi(file_contents); >> + >> +out: >> + return package_id; >> +} >> + >> +/* Read the scheduled time for a given thread of a give pid */ >> +void vmsr_read_thread_stat(pid_t pid, >> + unsigned int thread_id, >> + unsigned long long *utime, >> + unsigned long long *stime, >> + unsigned int *cpu_id) >> +{ >> + g_autofree char *path =3D NULL; >> + g_autofree char *path_name =3D NULL; >> + >> + path_name =3D g_strdup_printf("/proc/%u/task/%d/stat", pid, thread_= id); >> + >> + path =3D g_build_filename(path_name, NULL); >> + >> + FILE *file =3D fopen(path, "r"); >> + if (file =3D=3D NULL) { >> + pid =3D -1; >> + return; >> + } >> + >> + if (fscanf(file, "%*d (%*[^)]) %*c %*d %*d %*d %*d %*d %*u %*u %*u = %*u %*u" >> + " %llu %llu %*d %*d %*d %*d %*d %*d %*u %*u %*d %*u %*u" >> + " %*u %*u %*u %*u %*u %*u %*u %*u %*u %*d %*u %*u %u", >> + utime, stime, cpu_id) !=3D 3) >> + { >> + pid =3D -1; >> + return; >> + } >> + >> + fclose(file); >> + return; >> +} >> + >> +/* Read QEMU stat task folder to retrieve all QEMU threads ID */ >> +pid_t *vmsr_get_thread_ids(pid_t pid, unsigned int *num_threads) >> +{ >> + g_autofree char *task_path =3D g_strdup_printf("%d/task", pid); >> + g_autofree char *path =3D g_build_filename("/proc", task_path, NULL= ); >> + >> + DIR *dir =3D opendir(path); >> + if (dir =3D=3D NULL) { >> + error_report("Error opening /proc/qemu/task"); >> + return NULL; >> + } >> + >> + pid_t *thread_ids =3D NULL; >> + unsigned int thread_count =3D 0; >> + >> + g_autofree struct dirent *ent =3D NULL; >> + while ((ent =3D readdir(dir)) !=3D NULL) { >> + if (ent->d_name[0] =3D=3D '.') { >> + continue; >> + } >> + pid_t tid =3D atoi(ent->d_name); >> + if (pid !=3D tid) { >> + thread_ids =3D g_renew(pid_t, thread_ids, (thread_count + 1= )); >> + thread_ids[thread_count] =3D tid; >> + thread_count++; >> + } >> + } >> + >> + closedir(dir); >> + >> + *num_threads =3D thread_count; >> + return thread_ids; >> +} >> + >> +void vmsr_delta_ticks(vmsr_thread_stat *thd_stat, int i) >> +{ >> + thd_stat[i].delta_ticks =3D (thd_stat[i].utime[1] + thd_stat[i].sti= me[1]) >> + - (thd_stat[i].utime[0] + thd_stat[i].stime= [0]); >> +} >> + >> +double vmsr_get_ratio(uint64_t e_delta, >> + unsigned long long delta_ticks, >> + unsigned int maxticks) >> +{ >> + return (e_delta / 100.0) * ((100.0 / maxticks) * delta_ticks); >> +} >> + >> +void vmsr_init_topo_info(X86CPUTopoInfo *topo_info, >> + const MachineState *ms) >> +{ >> + topo_info->dies_per_pkg =3D ms->smp.dies; >> + topo_info->cores_per_die =3D ms->smp.cores; >> + topo_info->threads_per_core =3D ms->smp.threads; >> +} >> + >> diff --git a/target/i386/kvm/vmsr_energy.h b/target/i386/kvm/vmsr_energy= .h >> new file mode 100644 >> index 000000000000..16cc1f4814f6 >> --- /dev/null >> +++ b/target/i386/kvm/vmsr_energy.h >> @@ -0,0 +1,99 @@ >> +/* >> + * QEMU KVM support -- x86 virtual energy-related MSR. >> + * >> + * Copyright 2024 Red Hat, Inc. 2024 >> + * >> + * Author: >> + * Anthony Harivel >> + * >> + * This work is licensed under the terms of the GNU GPL, version 2 or l= ater. >> + * See the COPYING file in the top-level directory. >> + * >> + */ >> + >> +#ifndef VMSR_ENERGY_H >> +#define VMSR_ENERGY_H >> + >> +#include >> +#include "qemu/osdep.h" >> +#include "io/channel-socket.h" >> +#include "hw/i386/topology.h" >> + >> +/* >> + * Define the interval time in micro seconds between 2 samples of >> + * energy related MSRs >> + */ >> +#define MSR_ENERGY_THREAD_SLEEP_US 1000000.0 >> + >> +/* >> + * Thread statistic >> + * @ thread_id: TID (thread ID) >> + * @ is_vcpu: true if TID is vCPU thread >> + * @ cpu_id: CPU number last executed on >> + * @ pkg_id: package number of the CPU >> + * @ vcpu_id: vCPU ID >> + * @ vpkg: virtual package number >> + * @ acpi_id: APIC id of the vCPU >> + * @ utime: amount of clock ticks the thread >> + * has been scheduled in User mode >> + * @ stime: amount of clock ticks the thread >> + * has been scheduled in System mode >> + * @ delta_ticks: delta of utime+stime between >> + * the two samples (before/after sleep) >> + */ >> +struct vmsr_thread_stat { >> + unsigned int thread_id; >> + bool is_vcpu; >> + unsigned int cpu_id; >> + unsigned int pkg_id; >> + unsigned int vpkg_id; >> + unsigned int vcpu_id; >> + unsigned long acpi_id; >> + unsigned long long *utime; >> + unsigned long long *stime; >> + unsigned long long delta_ticks; >> +}; >> + >> +/* >> + * Package statistic >> + * @ e_start: package energy counter before the sleep >> + * @ e_end: package energy counter after the sleep >> + * @ e_delta: delta of package energy counter >> + * @ e_ratio: store the energy ratio of non-vCPU thread >> + * @ nb_vcpu: number of vCPU running on this package >> + */ >> +struct vmsr_package_energy_stat { >> + uint64_t e_start; >> + uint64_t e_end; >> + uint64_t e_delta; >> + uint64_t e_ratio; >> + unsigned int nb_vcpu; >> +}; >> + >> +typedef struct vmsr_thread_stat vmsr_thread_stat; >> +typedef struct vmsr_package_energy_stat vmsr_package_energy_stat; >> + >> +char *vmsr_compute_default_paths(void); >> +void vmsr_read_thread_stat(pid_t pid, >> + unsigned int thread_id, >> + unsigned long long *utime, >> + unsigned long long *stime, >> + unsigned int *cpu_id); >> + >> +QIOChannelSocket *vmsr_open_socket(const char *path); >> +uint64_t vmsr_read_msr(uint32_t reg, uint32_t cpu_id, >> + uint32_t tid, QIOChannelSocket *sioc); >> +void vmsr_delta_ticks(vmsr_thread_stat *thd_stat, int i); >> +unsigned int vmsr_get_maxcpus(void); >> +unsigned int vmsr_get_max_physical_package(unsigned int max_cpus); >> +unsigned int vmsr_count_cpus_per_package(unsigned int *package_count, >> + unsigned int max_pkgs); >> +int vmsr_get_physical_package_id(int cpu_id); >> +pid_t *vmsr_get_thread_ids(pid_t pid, unsigned int *num_threads); >> +double vmsr_get_ratio(uint64_t e_delta, >> + unsigned long long delta_ticks, >> + unsigned int maxticks); >> +void vmsr_init_topo_info(X86CPUTopoInfo *topo_info, const MachineState = *ms); >> +bool is_host_cpu_intel(void); >> +int is_rapl_enabled(void); >> +#endif /* VMSR_ENERGY_H */