[PATCH v6 0/3] Add support for the RAPL MSRs series

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 0/3] Add support for the RAPL MSRs series
@ 2024-05-22 15:34 Anthony Harivel
  2024-05-22 15:34 ` [PATCH v6 1/3] qio: add support for SO_PEERCRED for socket channel Anthony Harivel
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Anthony Harivel @ 2024-05-22 15:34 UTC (permalink / raw)
  To: pbonzini, mtosatti, berrange
  Cc: qemu-devel, vchundur, rjarry, Anthony Harivel

Dear maintainers, 

First of all, thank you very much for your review of my patch 
[1].

In this version (v6), I have attempted to address all the problems 
addressed by Daniel and Paolo during the last review. 

However, two open questions remains unanswered that would require the 
attention of a x86 maintainers: 

1)Should I move from -kvm to -cpu the rapl feature ? [2]

2)Should I already rename to "rapl_vmsr_*" in order to anticipate the 
  futur TMPI architecture ? [end of 3] 

Thank you again for your continued guidance. 

v5 -> v6
--------
- Better error consistency in qio_channel_get_peerpid()
- Memory leak g_strdup_printf/g_build_filename corrected
- Renaming several struct with "vmsr_*" for better namespace
- Renamed several struct with "guest_*" for better comprehension
- Optimization suggerate from Daniel
- Crash problem solved [4]

v4 -> v5
--------

- correct qio_channel_get_peerpid: return pid = -1 in case of error
- Vmsr_helper: compile only for x86
- Vmsr_helper: use qio_channel_read/write_all
- Vmsr_helper: abandon user/group
- Vmsr_energy.c: correct all error_report
- Vmsr thread: compute default socket path only once
- Vmsr thread: open socket only once
- Pass relevant QEMU CI

v3 -> v4
--------

- Correct memory leaks with AddressSanitizer  
- Add sanity check for QEMU and qemu-vmsr-helper for checking if host is 
  INTEL and if RAPL is activated.
- Rename poor variables naming for easier comprehension
- Move code that checks Host before creating the VMSR thread
- Get rid of libnuma: create function that read sysfs for reading the 
  Host topology instead

v2 -> v3
--------

- Move all memory allocations from Clib to Glib
- Compile on *BSD (working on Linux only)
- No more limitation on the virtual package: each vCPU that belongs to 
  the same virtual package is giving the same results like expected on 
  a real CPU.
  This has been tested topology like:
     -smp 4,sockets=2
     -smp 16,sockets=4,cores=2,threads=2

v1 -> v2
--------

- To overcome the CVE-2020-8694 a socket communication is created
  to a priviliged helper
- Add the priviliged helper (qemu-vmsr-helper)
- Add SO_PEERCRED in qio channel socket

RFC -> v1
---------

- Add vmsr_* in front of all vmsr specific function
- Change malloc()/calloc()... with all glib equivalent
- Pre-allocate all dynamic memories when possible
- Add a Documentation of implementation, limitation and usage

Best regards,
Anthony

[1]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg01570.html
[2]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg03947.html
[3]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02350.html
[4]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02481.html

Anthony Harivel (3):
  qio: add support for SO_PEERCRED for socket channel
  tools: build qemu-vmsr-helper
  Add support for RAPL MSRs in KVM/Qemu

 accel/kvm/kvm-all.c                      |  27 ++
 contrib/systemd/qemu-vmsr-helper.service |  15 +
 contrib/systemd/qemu-vmsr-helper.socket  |   9 +
 docs/specs/index.rst                     |   1 +
 docs/specs/rapl-msr.rst                  | 155 +++++++
 docs/tools/index.rst                     |   1 +
 docs/tools/qemu-vmsr-helper.rst          |  89 ++++
 include/io/channel.h                     |  21 +
 include/sysemu/kvm_int.h                 |  32 ++
 io/channel-socket.c                      |  28 ++
 io/channel.c                             |  13 +
 meson.build                              |   7 +
 target/i386/cpu.h                        |   8 +
 target/i386/kvm/kvm.c                    | 431 +++++++++++++++++-
 target/i386/kvm/meson.build              |   1 +
 target/i386/kvm/vmsr_energy.c            | 337 ++++++++++++++
 target/i386/kvm/vmsr_energy.h            |  99 +++++
 tools/i386/qemu-vmsr-helper.c            | 530 +++++++++++++++++++++++
 tools/i386/rapl-msr-index.h              |  28 ++
 19 files changed, 1831 insertions(+), 1 deletion(-)
 create mode 100644 contrib/systemd/qemu-vmsr-helper.service
 create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
 create mode 100644 docs/specs/rapl-msr.rst
 create mode 100644 docs/tools/qemu-vmsr-helper.rst
 create mode 100644 target/i386/kvm/vmsr_energy.c
 create mode 100644 target/i386/kvm/vmsr_energy.h
 create mode 100644 tools/i386/qemu-vmsr-helper.c
 create mode 100644 tools/i386/rapl-msr-index.h

-- 
2.45.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v6 1/3] qio: add support for SO_PEERCRED for socket channel
  2024-05-22 15:34 [PATCH v6 0/3] Add support for the RAPL MSRs series Anthony Harivel
@ 2024-05-22 15:34 ` Anthony Harivel
  2024-05-22 15:34 ` [PATCH v6 2/3] tools: build qemu-vmsr-helper Anthony Harivel
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 25+ messages in thread
From: Anthony Harivel @ 2024-05-22 15:34 UTC (permalink / raw)
  To: pbonzini, mtosatti, berrange
  Cc: qemu-devel, vchundur, rjarry, Anthony Harivel

The function qio_channel_get_peercred() returns a pointer to the
credentials of the peer process connected to this socket.

This credentials structure is defined in <sys/socket.h> as follows:

struct ucred {
	pid_t pid;    /* Process ID of the sending process */
	uid_t uid;    /* User ID of the sending process */
	gid_t gid;    /* Group ID of the sending process */
};

The use of this function is possible only for connected AF_UNIX stream
sockets and for AF_UNIX stream and datagram socket pairs.

On platform other than Linux, the function return 0.

Signed-off-by: Anthony Harivel <aharivel@redhat.com>
---
 include/io/channel.h | 21 +++++++++++++++++++++
 io/channel-socket.c  | 28 ++++++++++++++++++++++++++++
 io/channel.c         | 13 +++++++++++++
 3 files changed, 62 insertions(+)

diff --git a/include/io/channel.h b/include/io/channel.h
index 7986c49c713a..bdf0bca92ae2 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -160,6 +160,9 @@ struct QIOChannelClass {
                                   void *opaque);
     int (*io_flush)(QIOChannel *ioc,
                     Error **errp);
+    int (*io_peerpid)(QIOChannel *ioc,
+                       unsigned int *pid,
+                       Error **errp);
 };
 
 /* General I/O handling functions */
@@ -981,4 +984,22 @@ int coroutine_mixed_fn qio_channel_writev_full_all(QIOChannel *ioc,
 int qio_channel_flush(QIOChannel *ioc,
                       Error **errp);
 
+/**
+ * qio_channel_get_peercred:
+ * @ioc: the channel object
+ * @pid: pointer to pid
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Returns the pid of the peer process connected to this socket.
+ *
+ * The use of this function is possible only for connected
+ * AF_UNIX stream sockets and for AF_UNIX stream and datagram
+ * socket pairs on Linux.
+ * Return -1 on error with pid -1 for the non-Linux OS.
+ *
+ */
+int qio_channel_get_peerpid(QIOChannel *ioc,
+                             unsigned int *pid,
+                             Error **errp);
+
 #endif /* QIO_CHANNEL_H */
diff --git a/io/channel-socket.c b/io/channel-socket.c
index 3a899b060858..608bcf066ecd 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -841,6 +841,33 @@ qio_channel_socket_set_cork(QIOChannel *ioc,
     socket_set_cork(sioc->fd, v);
 }
 
+static int
+qio_channel_socket_get_peerpid(QIOChannel *ioc,
+                               unsigned int *pid,
+                               Error **errp)
+{
+#ifdef CONFIG_LINUX
+    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+    Error *err = NULL;
+    socklen_t len = sizeof(struct ucred);
+
+    struct ucred cred;
+    if (getsockopt(sioc->fd,
+               SOL_SOCKET, SO_PEERCRED,
+               &cred, &len) == -1) {
+        error_setg_errno(&err, errno, "Unable to get peer credentials");
+        error_propagate(errp, err);
+        *pid = -1;
+        return -1;
+    }
+    *pid = (unsigned int)cred.pid;
+    return 0;
+#else
+    error_setg(errp, "Unsupported feature");
+    *pid = -1;
+    return -1;
+#endif
+}
 
 static int
 qio_channel_socket_close(QIOChannel *ioc,
@@ -938,6 +965,7 @@ static void qio_channel_socket_class_init(ObjectClass *klass,
 #ifdef QEMU_MSG_ZEROCOPY
     ioc_klass->io_flush = qio_channel_socket_flush;
 #endif
+    ioc_klass->io_peerpid = qio_channel_socket_get_peerpid;
 }
 
 static const TypeInfo qio_channel_socket_info = {
diff --git a/io/channel.c b/io/channel.c
index a1f12f8e9096..e3f17c24a00f 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -548,6 +548,19 @@ void qio_channel_set_cork(QIOChannel *ioc,
     }
 }
 
+int qio_channel_get_peerpid(QIOChannel *ioc,
+                             unsigned int *pid,
+                             Error **errp)
+{
+    QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+    if (!klass->io_peerpid) {
+        error_setg(errp, "Channel does not support peer pid");
+        return -1;
+    }
+    klass->io_peerpid(ioc, pid, errp);
+    return 0;
+}
 
 off_t qio_channel_io_seek(QIOChannel *ioc,
                           off_t offset,
-- 
2.45.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v6 2/3] tools: build qemu-vmsr-helper
  2024-05-22 15:34 [PATCH v6 0/3] Add support for the RAPL MSRs series Anthony Harivel
  2024-05-22 15:34 ` [PATCH v6 1/3] qio: add support for SO_PEERCRED for socket channel Anthony Harivel
@ 2024-05-22 15:34 ` Anthony Harivel
  2024-05-22 15:34 ` [PATCH v6 3/3] Add support for RAPL MSRs in KVM/Qemu Anthony Harivel
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 25+ messages in thread
From: Anthony Harivel @ 2024-05-22 15:34 UTC (permalink / raw)
  To: pbonzini, mtosatti, berrange
  Cc: qemu-devel, vchundur, rjarry, Anthony Harivel

Introduce a privileged helper to access RAPL MSR.

The privileged helper tool, qemu-vmsr-helper, is designed to provide
virtual machines with the ability to read specific RAPL (Running Average
Power Limit) MSRs without requiring CAP_SYS_RAWIO privileges or relying
on external, out-of-tree patches.

The helper tool leverages Unix permissions and SO_PEERCRED socket
options to enforce access control, ensuring that only processes
explicitly requesting read access via readmsr() from a valid Thread ID
can access these MSRs.

The list of RAPL MSRs that are allowed to be read by the helper tool is
defined in rapl-msr-index.h. This list corresponds to the RAPL MSRs that
will be supported in the next commit titled "Add support for RAPL MSRs
in KVM/QEMU."

The tool is intentionally designed to run on the Linux x86 platform.
This initial implementation is tailored for Intel CPUs but can be
extended to support AMD CPUs in the future.

Signed-off-by: Anthony Harivel <aharivel@redhat.com>
---
 contrib/systemd/qemu-vmsr-helper.service |  15 +
 contrib/systemd/qemu-vmsr-helper.socket  |   9 +
 docs/tools/index.rst                     |   1 +
 docs/tools/qemu-vmsr-helper.rst          |  89 ++++
 meson.build                              |   7 +
 tools/i386/qemu-vmsr-helper.c            | 530 +++++++++++++++++++++++
 tools/i386/rapl-msr-index.h              |  28 ++
 7 files changed, 679 insertions(+)
 create mode 100644 contrib/systemd/qemu-vmsr-helper.service
 create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
 create mode 100644 docs/tools/qemu-vmsr-helper.rst
 create mode 100644 tools/i386/qemu-vmsr-helper.c
 create mode 100644 tools/i386/rapl-msr-index.h

diff --git a/contrib/systemd/qemu-vmsr-helper.service b/contrib/systemd/qemu-vmsr-helper.service
new file mode 100644
index 000000000000..8fd397bf79a9
--- /dev/null
+++ b/contrib/systemd/qemu-vmsr-helper.service
@@ -0,0 +1,15 @@
+[Unit]
+Description=Virtual RAPL MSR Daemon for QEMU
+
+[Service]
+WorkingDirectory=/tmp
+Type=simple
+ExecStart=/usr/bin/qemu-vmsr-helper
+PrivateTmp=yes
+ProtectSystem=strict
+ReadWritePaths=/var/run
+RestrictAddressFamilies=AF_UNIX
+Restart=always
+RestartSec=0
+
+[Install]
diff --git a/contrib/systemd/qemu-vmsr-helper.socket b/contrib/systemd/qemu-vmsr-helper.socket
new file mode 100644
index 000000000000..183e8304d6e2
--- /dev/null
+++ b/contrib/systemd/qemu-vmsr-helper.socket
@@ -0,0 +1,9 @@
+[Unit]
+Description=Virtual RAPL MSR helper for QEMU
+
+[Socket]
+ListenStream=/run/qemu-vmsr-helper.sock
+SocketMode=0600
+
+[Install]
+WantedBy=multi-user.target
diff --git a/docs/tools/index.rst b/docs/tools/index.rst
index 8e65ce0dfc7b..33ad438e86f6 100644
--- a/docs/tools/index.rst
+++ b/docs/tools/index.rst
@@ -16,3 +16,4 @@ command line utilities and other standalone programs.
    qemu-pr-helper
    qemu-trace-stap
    virtfs-proxy-helper
+   qemu-vmsr-helper
diff --git a/docs/tools/qemu-vmsr-helper.rst b/docs/tools/qemu-vmsr-helper.rst
new file mode 100644
index 000000000000..6ec87b49d962
--- /dev/null
+++ b/docs/tools/qemu-vmsr-helper.rst
@@ -0,0 +1,89 @@
+==================================
+QEMU virtual RAPL MSR helper
+==================================
+
+Synopsis
+--------
+
+**qemu-vmsr-helper** [*OPTION*]
+
+Description
+-----------
+
+Implements the virtual RAPL MSR helper for QEMU.
+
+Accessing the RAPL (Running Average Power Limit) MSR enables the RAPL powercap
+driver to advertise and monitor the power consumption or accumulated energy
+consumption of different power domains, such as CPU packages, DRAM, and other
+components when available.
+
+However those register are accesible under priviliged access (CAP_SYS_RAWIO).
+QEMU can use an external helper to access those priviliged register.
+
+:program:`qemu-vmsr-helper` is that external helper; it creates a listener
+socket which will accept incoming connections for communication with QEMU.
+
+If you want to run VMs in a setup like this, this helper should be started as a
+system service, and you should read the QEMU manual section on "RAPL MSR
+support" to find out how to configure QEMU to connect to the socket created by
+:program:`qemu-vmsr-helper`.
+
+After connecting to the socket, :program:`qemu-vmsr-helper` can
+optionally drop root privileges, except for those capabilities that
+are needed for its operation.
+
+:program:`qemu-vmsr-helper` can also use the systemd socket activation
+protocol.  In this case, the systemd socket unit should specify a
+Unix stream socket, like this::
+
+    [Socket]
+    ListenStream=/var/run/qemu-vmsr-helper.sock
+
+Options
+-------
+
+.. program:: qemu-vmsr-helper
+
+.. option:: -d, --daemon
+
+  run in the background (and create a PID file)
+
+.. option:: -q, --quiet
+
+  decrease verbosity
+
+.. option:: -v, --verbose
+
+  increase verbosity
+
+.. option:: -f, --pidfile=PATH
+
+  PID file when running as a daemon. By default the PID file
+  is created in the system runtime state directory, for example
+  :file:`/var/run/qemu-vmsr-helper.pid`.
+
+.. option:: -k, --socket=PATH
+
+  path to the socket. By default the socket is created in
+  the system runtime state directory, for example
+  :file:`/var/run/qemu-vmsr-helper.sock`.
+
+.. option:: -T, --trace [[enable=]PATTERN][,events=FILE][,file=FILE]
+
+  .. include:: ../qemu-option-trace.rst.inc
+
+.. option:: -u, --user=USER
+
+  user to drop privileges to
+
+.. option:: -g, --group=GROUP
+
+  group to drop privileges to
+
+.. option:: -h, --help
+
+  Display a help message and exit.
+
+.. option:: -V, --version
+
+  Display version information and exit.
diff --git a/meson.build b/meson.build
index a9de71d45064..9947680ad0fc 100644
--- a/meson.build
+++ b/meson.build
@@ -4021,6 +4021,13 @@ if have_tools
                dependencies: [authz, crypto, io, qom, qemuutil,
                               libcap_ng, mpathpersist],
                install: true)
+
+    if cpu in ['x86', 'x86_64']
+      executable('qemu-vmsr-helper', files('tools/i386/qemu-vmsr-helper.c'),
+               dependencies: [authz, crypto, io, qom, qemuutil,
+                              libcap_ng, mpathpersist],
+               install: true)
+    endif
   endif
 
   if have_ivshmem
diff --git a/tools/i386/qemu-vmsr-helper.c b/tools/i386/qemu-vmsr-helper.c
new file mode 100644
index 000000000000..ebf562c3ff87
--- /dev/null
+++ b/tools/i386/qemu-vmsr-helper.c
@@ -0,0 +1,530 @@
+/*
+ * Privileged RAPL MSR helper commands for QEMU
+ *
+ * Copyright (C) 2024 Red Hat, Inc. <aharivel@redhat.com>
+ *
+ * Author: Anthony Harivel <aharivel@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; under version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include <getopt.h>
+#include <stdbool.h>
+#include <sys/ioctl.h>
+#ifdef CONFIG_LIBCAP_NG
+#include <cap-ng.h>
+#endif
+#include <pwd.h>
+#include <grp.h>
+
+#include "qemu/help-texts.h"
+#include "qapi/error.h"
+#include "qemu/cutils.h"
+#include "qemu/main-loop.h"
+#include "qemu/module.h"
+#include "qemu/error-report.h"
+#include "qemu/config-file.h"
+#include "qemu-version.h"
+#include "qapi/error.h"
+#include "qemu/error-report.h"
+#include "qemu/log.h"
+#include "qemu/systemd.h"
+#include "io/channel.h"
+#include "io/channel-socket.h"
+#include "trace/control.h"
+#include "qemu-version.h"
+#include "rapl-msr-index.h"
+
+#define MSR_PATH_TEMPLATE "/dev/cpu/%u/msr"
+
+static char *socket_path;
+static char *pidfile;
+static enum { RUNNING, TERMINATE, TERMINATING } state;
+static QIOChannelSocket *server_ioc;
+static int server_watch;
+static int num_active_sockets = 1;
+
+#ifdef CONFIG_LIBCAP_NG
+static int uid = -1;
+static int gid = -1;
+#endif
+
+static void compute_default_paths(void)
+{
+    g_autofree char *state = qemu_get_local_state_dir();
+
+    socket_path = g_build_filename(state, "run", "qemu-vmsr-helper.sock", NULL);
+    pidfile = g_build_filename(state, "run", "qemu-vmsr-helper.pid", NULL);
+}
+
+static int is_intel_processor(void)
+{
+    int result;
+    int ebx, ecx, edx;
+
+    /* Execute CPUID instruction with eax=0 (basic identification) */
+    asm volatile (
+        "cpuid"
+        : "=b" (ebx), "=c" (ecx), "=d" (edx)
+        : "a" (0)
+    );
+
+    /*
+     *  Check if processor is "GenuineIntel"
+     *  0x756e6547 = "Genu"
+     *  0x49656e69 = "ineI"
+     *  0x6c65746e = "ntel"
+     */
+    result = (ebx == 0x756e6547) && (edx == 0x49656e69) && (ecx == 0x6c65746e);
+
+    return result;
+}
+
+static int is_rapl_enabled(void)
+{
+    const char *path = "/sys/class/powercap/intel-rapl/enabled";
+    FILE *file = fopen(path, "r");
+    int value = 0;
+
+    if (file != NULL) {
+        if (fscanf(file, "%d", &value) != 1) {
+            error_report("INTEL RAPL not enabled");
+        }
+        fclose(file);
+    } else {
+        error_report("Error opening %s", path);
+    }
+
+    return value;
+}
+
+/*
+ * Check if the TID that request the MSR read
+ * belongs to the peer. It be should a TID of a vCPU.
+ */
+static bool is_tid_present(pid_t pid, pid_t tid)
+{
+    g_autofree char *tidPath = g_strdup_printf("/proc/%d/task/%d", pid, tid);
+
+    /* Check if the TID directory exists within the PID directory */
+    if (access(tidPath, F_OK) == 0) {
+        return true;
+    }
+
+    error_report("Failed to open /proc at %s", tidPath);
+    return false;
+}
+
+/*
+ * Only the RAPL MSR in target/i386/cpu.h are allowed
+ */
+static bool is_msr_allowed(uint32_t reg)
+{
+    switch (reg) {
+    case MSR_RAPL_POWER_UNIT:
+    case MSR_PKG_POWER_LIMIT:
+    case MSR_PKG_ENERGY_STATUS:
+    case MSR_PKG_POWER_INFO:
+        return true;
+    default:
+        return false;
+    }
+}
+
+static uint64_t vmsr_read_msr(uint32_t msr_register, unsigned int cpu_id)
+{
+    int fd;
+    uint64_t result = 0;
+
+    g_autofree char *path = g_strdup_printf(MSR_PATH_TEMPLATE, cpu_id);
+
+    fd = open(path, O_RDONLY);
+    if (fd < 0) {
+        error_report("Failed to open MSR file at %s", path);
+        return result;
+    }
+
+    if (pread(fd, &result, sizeof(result), msr_register) != sizeof(result)) {
+        error_report("Failed to read MSR");
+        result = 0;
+    }
+
+    close(fd);
+    return result;
+}
+
+static void usage(const char *name)
+{
+    (printf) (
+"Usage: %s [OPTIONS] FILE\n"
+"Virtual RAPL MSR helper program for QEMU\n"
+"\n"
+"  -h, --help                display this help and exit\n"
+"  -V, --version             output version information and exit\n"
+"\n"
+"  -d, --daemon              run in the background\n"
+"  -f, --pidfile=PATH        PID file when running as a daemon\n"
+"                            (default '%s')\n"
+"  -k, --socket=PATH         path to the unix socket\n"
+"                            (default '%s')\n"
+"  -T, --trace [[enable=]<pattern>][,events=<file>][,file=<file>]\n"
+"                            specify tracing options\n"
+#ifdef CONFIG_LIBCAP_NG
+"  -u, --user=USER           user to drop privileges to\n"
+"  -g, --group=GROUP         group to drop privileges to\n"
+#endif
+"\n"
+QEMU_HELP_BOTTOM "\n"
+    , name, pidfile, socket_path);
+}
+
+static void version(const char *name)
+{
+    printf(
+"%s " QEMU_FULL_VERSION "\n"
+"Written by Anthony Harivel.\n"
+"\n"
+QEMU_COPYRIGHT "\n"
+"This is free software; see the source for copying conditions.  There is NO\n"
+"warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.\n"
+    , name);
+}
+
+typedef struct VMSRHelperClient {
+    QIOChannelSocket *ioc;
+    Coroutine *co;
+} VMSRHelperClient;
+
+static void coroutine_fn vh_co_entry(void *opaque)
+{
+    VMSRHelperClient *client = opaque;
+    Error *local_err = NULL;
+    unsigned int peer_pid;
+    uint32_t request[3];
+    uint64_t vmsr;
+    int r;
+
+    qio_channel_set_blocking(QIO_CHANNEL(client->ioc),
+                             false, NULL);
+
+    qio_channel_set_follow_coroutine_ctx(QIO_CHANNEL(client->ioc), true);
+
+    /*
+     * Check peer credentials
+     */
+    r = qio_channel_get_peerpid(QIO_CHANNEL(client->ioc),
+                                &peer_pid,
+                                &local_err);
+    if (r < 0) {
+        error_report_err(local_err);
+        goto out;
+    }
+
+    while (r < 0) {
+        /*
+         * Read the requested MSR
+         * Only RAPL MSR in rapl-msr-index.h is allowed
+         */
+        r = qio_channel_read_all(QIO_CHANNEL(client->ioc),
+                                (char *) &request, sizeof(request), &local_err);
+        if (r < 0) {
+            error_report_err(local_err);
+            break;
+        }
+
+        if (!is_msr_allowed(request[0])) {
+            error_report("Requested unallowed msr: %d", request[0]);
+            break;
+        }
+
+        vmsr = vmsr_read_msr(request[0], request[1]);
+
+        if (!is_tid_present(peer_pid, request[2])) {
+            error_report("Requested TID not in peer PID: %d %d",
+                peer_pid, request[2]);
+            vmsr = 0;
+        }
+
+        r = qio_channel_write_all(QIO_CHANNEL(client->ioc),
+                                  (char *) &vmsr,
+                                  sizeof(vmsr),
+                                  &local_err);
+        if (r < 0) {
+            error_report_err(local_err);
+            break;
+        }
+    }
+out:
+    object_unref(OBJECT(client->ioc));
+    g_free(client);
+}
+
+static gboolean accept_client(QIOChannel *ioc,
+                              GIOCondition cond,
+                              gpointer opaque)
+{
+    QIOChannelSocket *cioc;
+    VMSRHelperClient *vmsrh;
+
+    cioc = qio_channel_socket_accept(QIO_CHANNEL_SOCKET(ioc),
+                                     NULL);
+    if (!cioc) {
+        return TRUE;
+    }
+
+    vmsrh = g_new(VMSRHelperClient, 1);
+    vmsrh->ioc = cioc;
+    vmsrh->co = qemu_coroutine_create(vh_co_entry, vmsrh);
+    qemu_coroutine_enter(vmsrh->co);
+
+    return TRUE;
+}
+
+static void termsig_handler(int signum)
+{
+    qatomic_cmpxchg(&state, RUNNING, TERMINATE);
+    qemu_notify_event();
+}
+
+static void close_server_socket(void)
+{
+    assert(server_ioc);
+
+    g_source_remove(server_watch);
+    server_watch = -1;
+    object_unref(OBJECT(server_ioc));
+    num_active_sockets--;
+}
+
+#ifdef CONFIG_LIBCAP_NG
+static int drop_privileges(void)
+{
+    /* clear all capabilities */
+    capng_clear(CAPNG_SELECT_BOTH);
+
+    if (capng_update(CAPNG_ADD, CAPNG_EFFECTIVE | CAPNG_PERMITTED,
+                     CAP_SYS_RAWIO) < 0) {
+        return -1;
+    }
+
+    return 0;
+}
+#endif
+
+int main(int argc, char **argv)
+{
+    const char *sopt = "hVk:f:dT:u:g:vq";
+    struct option lopt[] = {
+        { "help", no_argument, NULL, 'h' },
+        { "version", no_argument, NULL, 'V' },
+        { "socket", required_argument, NULL, 'k' },
+        { "pidfile", required_argument, NULL, 'f' },
+        { "daemon", no_argument, NULL, 'd' },
+        { "trace", required_argument, NULL, 'T' },
+        { "verbose", no_argument, NULL, 'v' },
+        { NULL, 0, NULL, 0 }
+    };
+    int opt_ind = 0;
+    int ch;
+    Error *local_err = NULL;
+    bool daemonize = false;
+    bool pidfile_specified = false;
+    bool socket_path_specified = false;
+    unsigned socket_activation;
+
+    struct sigaction sa_sigterm;
+    memset(&sa_sigterm, 0, sizeof(sa_sigterm));
+    sa_sigterm.sa_handler = termsig_handler;
+    sigaction(SIGTERM, &sa_sigterm, NULL);
+    sigaction(SIGINT, &sa_sigterm, NULL);
+    sigaction(SIGHUP, &sa_sigterm, NULL);
+
+    signal(SIGPIPE, SIG_IGN);
+
+    error_init(argv[0]);
+    module_call_init(MODULE_INIT_TRACE);
+    module_call_init(MODULE_INIT_QOM);
+    qemu_add_opts(&qemu_trace_opts);
+    qemu_init_exec_dir(argv[0]);
+
+    compute_default_paths();
+
+    /*
+     * Sanity check
+     * 1. cpu must be Intel cpu
+     * 2. RAPL must be enabled
+     */
+    if (!is_intel_processor()) {
+        error_report("error: CPU is not INTEL cpu");
+        exit(EXIT_FAILURE);
+    }
+
+    if (!is_rapl_enabled()) {
+        error_report("error: RAPL driver not enable");
+        exit(EXIT_FAILURE);
+    }
+
+    while ((ch = getopt_long(argc, argv, sopt, lopt, &opt_ind)) != -1) {
+        switch (ch) {
+        case 'k':
+            g_free(socket_path);
+            socket_path = g_strdup(optarg);
+            socket_path_specified = true;
+            if (socket_path[0] != '/') {
+                error_report("socket path must be absolute");
+                exit(EXIT_FAILURE);
+            }
+            break;
+        case 'f':
+            g_free(pidfile);
+            pidfile = g_strdup(optarg);
+            pidfile_specified = true;
+            break;
+#ifdef CONFIG_LIBCAP_NG
+        case 'u': {
+            unsigned long res;
+            struct passwd *userinfo = getpwnam(optarg);
+            if (userinfo) {
+                uid = userinfo->pw_uid;
+            } else if (qemu_strtoul(optarg, NULL, 10, &res) == 0 &&
+                       (uid_t)res == res) {
+                uid = res;
+            } else {
+                error_report("invalid user '%s'", optarg);
+                exit(EXIT_FAILURE);
+            }
+            break;
+        }
+        case 'g': {
+            unsigned long res;
+            struct group *groupinfo = getgrnam(optarg);
+            if (groupinfo) {
+                gid = groupinfo->gr_gid;
+            } else if (qemu_strtoul(optarg, NULL, 10, &res) == 0 &&
+                       (gid_t)res == res) {
+                gid = res;
+            } else {
+                error_report("invalid group '%s'", optarg);
+                exit(EXIT_FAILURE);
+            }
+            break;
+        }
+#else
+        case 'u':
+        case 'g':
+            error_report("-%c not supported by this %s", ch, argv[0]);
+            exit(1);
+#endif
+        case 'd':
+            daemonize = true;
+            break;
+        case 'T':
+            trace_opt_parse(optarg);
+            break;
+        case 'V':
+            version(argv[0]);
+            exit(EXIT_SUCCESS);
+            break;
+        case 'h':
+            usage(argv[0]);
+            exit(EXIT_SUCCESS);
+            break;
+        case '?':
+            error_report("Try `%s --help' for more information.", argv[0]);
+            exit(EXIT_FAILURE);
+        }
+    }
+
+    if (!trace_init_backends()) {
+        exit(EXIT_FAILURE);
+    }
+    trace_init_file();
+    qemu_set_log(LOG_TRACE, &error_fatal);
+
+    socket_activation = check_socket_activation();
+    if (socket_activation == 0) {
+        SocketAddress saddr;
+        saddr = (SocketAddress){
+            .type = SOCKET_ADDRESS_TYPE_UNIX,
+            .u.q_unix.path = socket_path,
+        };
+        server_ioc = qio_channel_socket_new();
+        if (qio_channel_socket_listen_sync(server_ioc, &saddr,
+                                           1, &local_err) < 0) {
+            object_unref(OBJECT(server_ioc));
+            error_report_err(local_err);
+            return 1;
+        }
+    } else {
+        /* Using socket activation - check user didn't use -p etc. */
+        if (socket_path_specified) {
+            error_report("Unix socket can't be set when"
+                         "using socket activation");
+            exit(EXIT_FAILURE);
+        }
+
+        /* Can only listen on a single socket.  */
+        if (socket_activation > 1) {
+            error_report("%s does not support socket activation"
+                         "with LISTEN_FDS > 1",
+                        argv[0]);
+            exit(EXIT_FAILURE);
+        }
+        server_ioc = qio_channel_socket_new_fd(FIRST_SOCKET_ACTIVATION_FD,
+                                               &local_err);
+        if (server_ioc == NULL) {
+            error_reportf_err(local_err,
+                              "Failed to use socket activation: ");
+            exit(EXIT_FAILURE);
+        }
+    }
+
+    qemu_init_main_loop(&error_fatal);
+
+    server_watch = qio_channel_add_watch(QIO_CHANNEL(server_ioc),
+                                         G_IO_IN,
+                                         accept_client,
+                                         NULL, NULL);
+
+    if (daemonize) {
+        if (daemon(0, 0) < 0) {
+            error_report("Failed to daemonize: %s", strerror(errno));
+            exit(EXIT_FAILURE);
+        }
+    }
+
+    if (daemonize || pidfile_specified) {
+        qemu_write_pidfile(pidfile, &error_fatal);
+    }
+
+#ifdef CONFIG_LIBCAP_NG
+    if (drop_privileges() < 0) {
+        error_report("Failed to drop privileges: %s", strerror(errno));
+        exit(EXIT_FAILURE);
+    }
+#endif
+
+    info_report("Listening on %s", socket_path);
+
+    state = RUNNING;
+    do {
+        main_loop_wait(false);
+        if (state == TERMINATE) {
+            state = TERMINATING;
+            close_server_socket();
+        }
+    } while (num_active_sockets > 0);
+
+    exit(EXIT_SUCCESS);
+}
diff --git a/tools/i386/rapl-msr-index.h b/tools/i386/rapl-msr-index.h
new file mode 100644
index 000000000000..9a7118639ae3
--- /dev/null
+++ b/tools/i386/rapl-msr-index.h
@@ -0,0 +1,28 @@
+/*
+ * Allowed list of MSR for Privileged RAPL MSR helper commands for QEMU
+ *
+ * Copyright (C) 2023 Red Hat, Inc. <aharivel@redhat.com>
+ *
+ * Author: Anthony Harivel <aharivel@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; under version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * Should stay in sync with the RAPL MSR
+ * in target/i386/cpu.h
+ */
+#define MSR_RAPL_POWER_UNIT             0x00000606
+#define MSR_PKG_POWER_LIMIT             0x00000610
+#define MSR_PKG_ENERGY_STATUS           0x00000611
+#define MSR_PKG_POWER_INFO              0x00000614
-- 
2.45.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v6 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-05-22 15:34 [PATCH v6 0/3] Add support for the RAPL MSRs series Anthony Harivel
  2024-05-22 15:34 ` [PATCH v6 1/3] qio: add support for SO_PEERCRED for socket channel Anthony Harivel
  2024-05-22 15:34 ` [PATCH v6 2/3] tools: build qemu-vmsr-helper Anthony Harivel
@ 2024-05-22 15:34 ` Anthony Harivel
  2024-10-16 12:17   ` Igor Mammedov
  2024-06-26 14:34 ` [PATCH v6 0/3] Add support for the RAPL MSRs series Anthony Harivel
  2024-10-16 11:52 ` Igor Mammedov
  4 siblings, 1 reply; 25+ messages in thread
From: Anthony Harivel @ 2024-05-22 15:34 UTC (permalink / raw)
  To: pbonzini, mtosatti, berrange
  Cc: qemu-devel, vchundur, rjarry, Anthony Harivel

Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
interface (Running Average Power Limit) for advertising the accumulated
energy consumption of various power domains (e.g. CPU packages, DRAM,
etc.).

The consumption is reported via MSRs (model specific registers) like
MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
64 bits registers that represent the accumulated energy consumption in
micro Joules. They are updated by microcode every ~1ms.

For now, KVM always returns 0 when the guest requests the value of
these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
these MSRs dynamically in userspace.

To limit the amount of system calls for every MSR call, create a new
thread in QEMU that updates the "virtual" MSR values asynchronously.

Each vCPU has its own vMSR to reflect the independence of vCPUs. The
thread updates the vMSR values with the ratio of energy consumed of
the whole physical CPU package the vCPU thread runs on and the
thread's utime and stime values.

All other non-vCPU threads are also taken into account. Their energy
consumption is evenly distributed among all vCPUs threads running on
the same physical CPU package.

To overcome the problem that reading the RAPL MSR requires priviliged
access, a socket communication between QEMU and the qemu-vmsr-helper is
mandatory. You can specified the socket path in the parameter.

This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock

Actual limitation:
- Works only on Intel host CPU because AMD CPUs are using different MSR
  adresses.

- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at
  the moment.

Signed-off-by: Anthony Harivel <aharivel@redhat.com>
---
 accel/kvm/kvm-all.c           |  27 +++
 docs/specs/index.rst          |   1 +
 docs/specs/rapl-msr.rst       | 155 ++++++++++++
 include/sysemu/kvm_int.h      |  32 +++
 target/i386/cpu.h             |   8 +
 target/i386/kvm/kvm.c         | 431 +++++++++++++++++++++++++++++++++-
 target/i386/kvm/meson.build   |   1 +
 target/i386/kvm/vmsr_energy.c | 344 +++++++++++++++++++++++++++
 target/i386/kvm/vmsr_energy.h |  99 ++++++++
 9 files changed, 1097 insertions(+), 1 deletion(-)
 create mode 100644 docs/specs/rapl-msr.rst
 create mode 100644 target/i386/kvm/vmsr_energy.c
 create mode 100644 target/i386/kvm/vmsr_energy.h

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index c0be9f5eedb8..f455e6b987b4 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3745,6 +3745,21 @@ static void kvm_set_device(Object *obj,
     s->device = g_strdup(value);
 }
 
+static void kvm_set_kvm_rapl(Object *obj, bool value, Error **errp)
+{
+    KVMState *s = KVM_STATE(obj);
+    s->msr_energy.enable = value;
+}
+
+static void kvm_set_kvm_rapl_socket_path(Object *obj,
+                                         const char *str,
+                                         Error **errp)
+{
+    KVMState *s = KVM_STATE(obj);
+    g_free(s->msr_energy.socket_path);
+    s->msr_energy.socket_path = g_strdup(str);
+}
+
 static void kvm_accel_instance_init(Object *obj)
 {
     KVMState *s = KVM_STATE(obj);
@@ -3764,6 +3779,7 @@ static void kvm_accel_instance_init(Object *obj)
     s->xen_gnttab_max_frames = 64;
     s->xen_evtchn_max_pirq = 256;
     s->device = NULL;
+    s->msr_energy.enable = false;
 }
 
 /**
@@ -3808,6 +3824,17 @@ static void kvm_accel_class_init(ObjectClass *oc, void *data)
     object_class_property_set_description(oc, "device",
         "Path to the device node to use (default: /dev/kvm)");
 
+    object_class_property_add_bool(oc, "rapl",
+                                   NULL,
+                                   kvm_set_kvm_rapl);
+    object_class_property_set_description(oc, "rapl",
+        "Allow energy related MSRs for RAPL interface in Guest");
+
+    object_class_property_add_str(oc, "rapl-helper-socket", NULL,
+                                  kvm_set_kvm_rapl_socket_path);
+    object_class_property_set_description(oc, "rapl-helper-socket",
+        "Socket Path for comminucating with the Virtual MSR helper daemon");
+
     kvm_arch_accel_class_init(oc);
 }
 
diff --git a/docs/specs/index.rst b/docs/specs/index.rst
index 1484e3e76077..e738ea7d102f 100644
--- a/docs/specs/index.rst
+++ b/docs/specs/index.rst
@@ -33,3 +33,4 @@ guest hardware that is specific to QEMU.
    virt-ctlr
    vmcoreinfo
    vmgenid
+   rapl-msr
diff --git a/docs/specs/rapl-msr.rst b/docs/specs/rapl-msr.rst
new file mode 100644
index 000000000000..1202ee89bee0
--- /dev/null
+++ b/docs/specs/rapl-msr.rst
@@ -0,0 +1,155 @@
+================
+RAPL MSR support
+================
+
+The RAPL interface (Running Average Power Limit) is advertising the accumulated
+energy consumption of various power domains (e.g. CPU packages, DRAM, etc.).
+
+The consumption is reported via MSRs (model specific registers) like
+MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are 64 bits
+registers that represent the accumulated energy consumption in micro Joules.
+
+Thanks to the MSR Filtering patch [#a]_ not all MSRs are handled by KVM. Some
+of them can now be handled by the userspace (QEMU). It uses a mechanism called
+"MSR filtering" where a list of MSRs is given at init time of a VM to KVM so
+that a callback is put in place. The design of this patch uses only this
+mechanism for handling the MSRs between guest/host.
+
+At the moment the following MSRs are involved:
+
+.. code:: C
+
+    #define MSR_RAPL_POWER_UNIT             0x00000606
+    #define MSR_PKG_POWER_LIMIT             0x00000610
+    #define MSR_PKG_ENERGY_STATUS           0x00000611
+    #define MSR_PKG_POWER_INFO              0x00000614
+
+The ``*_POWER_UNIT``, ``*_POWER_LIMIT``, ``*_POWER INFO`` are part of the RAPL
+spec and specify the power limit of the package, provide range of parameter(min
+power, max power,..) and also the information of the multiplier for the energy
+counter to calculate the power. Those MSRs are populated once at the beginning
+by reading the host CPU MSRs and are given back to the guest 1:1 when
+requested.
+
+The MSR_PKG_ENERGY_STATUS is a counter; it represents the total amount of
+energy consumed since the last time the register was cleared. If you multiply
+it with the UNIT provided above you'll get the power in micro-joules. This
+counter is always increasing and it increases more or less faster depending on
+the consumption of the package. This counter is supposed to overflow at some
+point.
+
+Each core belonging to the same Package reading the MSR_PKG_ENERGY_STATUS (i.e
+"rdmsr 0x611") will retrieve the same value. The value represents the energy
+for the whole package. Whatever Core reading it will get the same value and a
+core that belongs to PKG-0 will not be able to get the value of PKG-1 and
+vice-versa.
+
+High level implementation
+-------------------------
+
+In order to update the value of the virtual MSR, a QEMU thread is created.
+The thread is basically just an infinity loop that does:
+
+1. Snapshot of the time metrics of all QEMU threads (Time spent scheduled in
+   Userspace and System)
+
+2. Snapshot of the actual MSR_PKG_ENERGY_STATUS counter of all packages where
+   the QEMU threads are running on.
+
+3. Sleep for 1 second - During this pause the vcpu and other non-vcpu threads
+   will do what they have to do and so the energy counter will increase.
+
+4. Repeat 2. and 3. and calculate the delta of every metrics representing the
+   time spent scheduled for each QEMU thread *and* the energy spent by the
+   packages during the pause.
+
+5. Filter the vcpu threads and the non-vcpu threads.
+
+6. Retrieve the topology of the Virtual Machine. This helps identify which
+   vCPU is running on which virtual package.
+
+7. The total energy spent by the non-vcpu threads is divided by the number
+   of vcpu threads so that each vcpu thread will get an equal part of the
+   energy spent by the QEMU workers.
+
+8. Calculate the ratio of energy spent per vcpu threads.
+
+9. Calculate the energy for each virtual package.
+
+10. The virtual MSRs are updated for each virtual package. Each vCPU that
+    belongs to the same package will return the same value when accessing the
+    the MSR.
+
+11. Loop back to 1.
+
+Ratio calculation
+-----------------
+
+In Linux, a process has an execution time associated with it. The scheduler is
+dividing the time in clock ticks. The number of clock ticks per second can be
+found by the sysconf system call. A typical value of clock ticks per second is
+100. So a core can run a process at the maximum of 100 ticks per second. If a
+package has 4 cores, 400 ticks maximum can be scheduled on all the cores
+of the package for a period of 1 second.
+
+The /proc/[pid]/stat [#b]_ is a sysfs file that can give the executed time of a
+process with the [pid] as the process ID. It gives the amount of ticks the
+process has been scheduled in userspace (utime) and kernel space (stime).
+
+By reading those metrics for a thread, one can calculate the ratio of time the
+package has spent executing the thread.
+
+Example:
+
+A 4 cores package can schedule a maximum of 400 ticks per second with 100 ticks
+per second per core. If a thread was scheduled for 100 ticks between a second
+on this package, that means my thread has been scheduled for 1/4 of the whole
+package. With that, the calculation of the energy spent by the thread on this
+package during this whole second is 1/4 of the total energy spent by the
+package.
+
+Usage
+-----
+
+Currently this feature is only working on an Intel CPU that has the RAPL driver
+mounted and available in the sysfs. if not, QEMU fails at start-up.
+
+This feature is activated with -accel
+kvm,rapl=true,rapl-helper-socket=/path/sock.sock
+
+It is important that the socket path is the same as the one
+:program:`qemu-vmsr-helper` is listening to.
+
+qemu-vmsr-helper
+----------------
+
+The qemu-vmsr-helper is working very much like the qemu-pr-helper. Instead of
+making persistent reservation, qemu-vmsr-helper is here to overcome the
+CVE-2020-8694 which remove user access to the rapl msr attributes.
+
+A socket communication is established between QEMU processes that has the RAPL
+MSR support activated and the qemu-vmsr-helper. A systemd service and socket
+activation is provided in contrib/systemd/qemu-vmsr-helper.(service/socket).
+
+The systemd socket uses 600, like contrib/systemd/qemu-pr-helper.socket. The
+socket can be passed via SCM_RIGHTS by libvirt, or its permissions can be
+changed (e.g. 660 and root:kvm for a Debian system for example). Libvirt could
+also start a separate helper if needed. All in all, the policy is left to the
+user.
+
+See the qemu-pr-helper documentation or manpage for further details.
+
+Current Limitations
+-------------------
+
+- Works only on Intel host CPUs because AMD CPUs are using different MSR
+  addresses.
+
+- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at the
+  moment.
+
+References
+----------
+
+.. [#a] https://patchwork.kernel.org/project/kvm/patch/20200916202951.23760-7-graf@amazon.com/
+.. [#b] https://man7.org/linux/man-pages/man5/proc.5.html
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 3f3d13f81669..1d8fb1473bdf 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -14,6 +14,9 @@
 #include "qemu/accel.h"
 #include "qemu/queue.h"
 #include "sysemu/kvm.h"
+#include "hw/boards.h"
+#include "hw/i386/topology.h"
+#include "io/channel-socket.h"
 
 typedef struct KVMSlot
 {
@@ -50,6 +53,34 @@ typedef struct KVMMemoryListener {
 
 #define KVM_MSI_HASHTAB_SIZE    256
 
+typedef struct KVMHostTopoInfo {
+    /* Number of package on the Host */
+    unsigned int maxpkgs;
+    /* Number of cpus on the Host */
+    unsigned int maxcpus;
+    /* Number of cpus on each different package */
+    unsigned int *pkg_cpu_count;
+    /* Each package can have different maxticks */
+    unsigned int *maxticks;
+} KVMHostTopoInfo;
+
+struct KVMMsrEnergy {
+    pid_t pid;
+    bool enable;
+    char *socket_path;
+    QIOChannelSocket *sioc;
+    QemuThread msr_thr;
+    unsigned int guest_vcpus;
+    unsigned int guest_vsockets;
+    X86CPUTopoInfo guest_topo_info;
+    KVMHostTopoInfo host_topo;
+    const CPUArchIdList *guest_cpu_list;
+    uint64_t *msr_value;
+    uint64_t msr_unit;
+    uint64_t msr_limit;
+    uint64_t msr_info;
+};
+
 enum KVMDirtyRingReaperState {
     KVM_DIRTY_RING_REAPER_NONE = 0,
     /* The reaper is sleeping */
@@ -117,6 +148,7 @@ struct KVMState
     bool kvm_dirty_ring_with_bitmap;
     uint64_t kvm_eager_split_size;  /* Eager Page Splitting chunk size */
     struct KVMDirtyRingReaper reaper;
+    struct KVMMsrEnergy msr_energy;
     NotifyVmexitOption notify_vmexit;
     uint32_t notify_window;
     uint32_t xen_version;
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index ccccb62fc353..c3891c1a6b4e 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -397,6 +397,10 @@ typedef enum X86Seg {
 #define MSR_IA32_TSX_CTRL		0x122
 #define MSR_IA32_TSCDEADLINE            0x6e0
 #define MSR_IA32_PKRS                   0x6e1
+#define MSR_RAPL_POWER_UNIT             0x00000606
+#define MSR_PKG_POWER_LIMIT             0x00000610
+#define MSR_PKG_ENERGY_STATUS           0x00000611
+#define MSR_PKG_POWER_INFO              0x00000614
 #define MSR_ARCH_LBR_CTL                0x000014ce
 #define MSR_ARCH_LBR_DEPTH              0x000014cf
 #define MSR_ARCH_LBR_FROM_0             0x00001500
@@ -1790,6 +1794,10 @@ typedef struct CPUArchState {
 
     uintptr_t retaddr;
 
+    /* RAPL MSR */
+    uint64_t msr_rapl_power_unit;
+    uint64_t msr_pkg_energy_status;
+
     /* Fields up to this point are cleared by a CPU reset */
     struct {} end_reset_fields;
 
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index c5943605ee3a..8767c8e06028 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -16,9 +16,12 @@
 #include "qapi/qapi-events-run-state.h"
 #include "qapi/error.h"
 #include "qapi/visitor.h"
+#include <math.h>
 #include <sys/ioctl.h>
 #include <sys/utsname.h>
 #include <sys/syscall.h>
+#include <sys/resource.h>
+#include <sys/time.h>
 
 #include <linux/kvm.h>
 #include "standard-headers/asm-x86/kvm_para.h"
@@ -26,6 +29,7 @@
 
 #include "cpu.h"
 #include "host-cpu.h"
+#include "vmsr_energy.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/hw_accel.h"
 #include "sysemu/kvm_int.h"
@@ -2519,7 +2523,8 @@ static int kvm_get_supported_msrs(KVMState *s)
     return ret;
 }
 
-static bool kvm_rdmsr_core_thread_count(X86CPU *cpu, uint32_t msr,
+static bool kvm_rdmsr_core_thread_count(X86CPU *cpu,
+                                        uint32_t msr,
                                         uint64_t *val)
 {
     CPUState *cs = CPU(cpu);
@@ -2530,6 +2535,53 @@ static bool kvm_rdmsr_core_thread_count(X86CPU *cpu, uint32_t msr,
     return true;
 }
 
+static bool kvm_rdmsr_rapl_power_unit(X86CPU *cpu,
+                                      uint32_t msr,
+                                      uint64_t *val)
+{
+
+    CPUState *cs = CPU(cpu);
+
+    *val = cs->kvm_state->msr_energy.msr_unit;
+
+    return true;
+}
+
+static bool kvm_rdmsr_pkg_power_limit(X86CPU *cpu,
+                                      uint32_t msr,
+                                      uint64_t *val)
+{
+
+    CPUState *cs = CPU(cpu);
+
+    *val = cs->kvm_state->msr_energy.msr_limit;
+
+    return true;
+}
+
+static bool kvm_rdmsr_pkg_power_info(X86CPU *cpu,
+                                     uint32_t msr,
+                                     uint64_t *val)
+{
+
+    CPUState *cs = CPU(cpu);
+
+    *val = cs->kvm_state->msr_energy.msr_info;
+
+    return true;
+}
+
+static bool kvm_rdmsr_pkg_energy_status(X86CPU *cpu,
+                                        uint32_t msr,
+                                        uint64_t *val)
+{
+
+    CPUState *cs = CPU(cpu);
+    *val = cs->kvm_state->msr_energy.msr_value[cs->cpu_index];
+
+    return true;
+}
+
 static Notifier smram_machine_done;
 static KVMMemoryListener smram_listener;
 static AddressSpace smram_address_space;
@@ -2564,6 +2616,340 @@ static void register_smram_listener(Notifier *n, void *unused)
                                  &smram_address_space, 1, "kvm-smram");
 }
 
+static void *kvm_msr_energy_thread(void *data)
+{
+    KVMState *s = data;
+    struct KVMMsrEnergy *vmsr = &s->msr_energy;
+
+    g_autofree vmsr_package_energy_stat *pkg_stat = NULL;
+    g_autofree vmsr_thread_stat *thd_stat = NULL;
+    g_autofree CPUState *cpu = NULL;
+    g_autofree unsigned int *vpkgs_energy_stat = NULL;
+    unsigned int num_threads = 0;
+
+    X86CPUTopoIDs topo_ids;
+
+    rcu_register_thread();
+
+    /* Allocate memory for each package energy status */
+    pkg_stat = g_new0(vmsr_package_energy_stat, vmsr->host_topo.maxpkgs);
+
+    /* Allocate memory for thread stats */
+    thd_stat = g_new0(vmsr_thread_stat, 1);
+
+    /* Allocate memory for holding virtual package energy counter */
+    vpkgs_energy_stat = g_new0(unsigned int, vmsr->guest_vsockets);
+
+    /* Populate the max tick of each packages */
+    for (int i = 0; i < vmsr->host_topo.maxpkgs; i++) {
+        /*
+         * Max numbers of ticks per package
+         * Time in second * Number of ticks/second * Number of cores/package
+         * ex: 100 ticks/second/CPU, 12 CPUs per Package gives 1200 ticks max
+         */
+        vmsr->host_topo.maxticks[i] = (MSR_ENERGY_THREAD_SLEEP_US / 1000000)
+                        * sysconf(_SC_CLK_TCK)
+                        * vmsr->host_topo.pkg_cpu_count[i];
+    }
+
+    while (true) {
+        /* Get all qemu threads id */
+        g_autofree pid_t *thread_ids =
+            thread_ids = vmsr_get_thread_ids(vmsr->pid, &num_threads);
+
+        if (thread_ids == NULL) {
+            goto clean;
+        }
+
+        thd_stat = g_renew(vmsr_thread_stat, thd_stat, num_threads);
+        /* Unlike g_new0, g_renew0 function doesn't exist yet... */
+        memset(thd_stat, 0, num_threads * sizeof(vmsr_thread_stat));
+
+        /* Populate all the thread stats */
+        for (int i = 0; i < num_threads; i++) {
+            thd_stat[i].utime = g_new0(unsigned long long, 2);
+            thd_stat[i].stime = g_new0(unsigned long long, 2);
+            thd_stat[i].thread_id = thread_ids[i];
+            vmsr_read_thread_stat(vmsr->pid,
+                                  thd_stat[i].thread_id,
+                                  thd_stat[i].utime,
+                                  thd_stat[i].stime,
+                                  &thd_stat[i].cpu_id);
+            thd_stat[i].pkg_id =
+                vmsr_get_physical_package_id(thd_stat[i].cpu_id);
+        }
+
+        /* Retrieve all packages power plane energy counter */
+        for (int i = 0; i < vmsr->host_topo.maxpkgs; i++) {
+            for (int j = 0; j < num_threads; j++) {
+                /*
+                 * Use the first thread we found that ran on the CPU
+                 * of the package to read the packages energy counter
+                 */
+                if (thd_stat[j].pkg_id == i) {
+                    pkg_stat[i].e_start =
+                    vmsr_read_msr(MSR_PKG_ENERGY_STATUS,
+                                  thd_stat[j].cpu_id,
+                                  thd_stat[j].thread_id,
+                                  s->msr_energy.sioc);
+                    break;
+                }
+            }
+        }
+
+        /* Sleep a short period while the other threads are working */
+        usleep(MSR_ENERGY_THREAD_SLEEP_US);
+
+        /*
+         * Retrieve all packages power plane energy counter
+         * Calculate the delta of all packages
+         */
+        for (int i = 0; i < vmsr->host_topo.maxpkgs; i++) {
+            for (int j = 0; j < num_threads; j++) {
+                /*
+                 * Use the first thread we found that ran on the CPU
+                 * of the package to read the packages energy counter
+                 */
+                if (thd_stat[j].pkg_id == i) {
+                    pkg_stat[i].e_end =
+                    vmsr_read_msr(MSR_PKG_ENERGY_STATUS,
+                                  thd_stat[j].cpu_id,
+                                  thd_stat[j].thread_id,
+                                  s->msr_energy.sioc);
+                    /*
+                     * Prevent the case we have migrate the VM
+                     * during the sleep period or any other cases
+                     * were energy counter might be lower after
+                     * the sleep period.
+                     */
+                    if (pkg_stat[i].e_end > pkg_stat[i].e_start) {
+                        pkg_stat[i].e_delta =
+                            pkg_stat[i].e_end - pkg_stat[i].e_start;
+                    } else {
+                        pkg_stat[i].e_delta = 0;
+                    }
+                    break;
+                }
+            }
+        }
+
+        /* Delta of ticks spend by each thread between the sample */
+        for (int i = 0; i < num_threads; i++) {
+            vmsr_read_thread_stat(vmsr->pid,
+                                  thd_stat[i].thread_id,
+                                  thd_stat[i].utime,
+                                  thd_stat[i].stime,
+                                  &thd_stat[i].cpu_id);
+
+            if (vmsr->pid < 0) {
+                /*
+                 * We don't count the dead thread
+                 * i.e threads that existed before the sleep
+                 * and not anymore
+                 */
+                thd_stat[i].delta_ticks = 0;
+            } else {
+                vmsr_delta_ticks(thd_stat, i);
+            }
+        }
+
+        /*
+         * Identify the vcpu threads
+         * Calculate the number of vcpu per package
+         */
+        CPU_FOREACH(cpu) {
+            for (int i = 0; i < num_threads; i++) {
+                if (cpu->thread_id == thd_stat[i].thread_id) {
+                    thd_stat[i].is_vcpu = true;
+                    thd_stat[i].vcpu_id = cpu->cpu_index;
+                    pkg_stat[thd_stat[i].pkg_id].nb_vcpu++;
+                    thd_stat[i].acpi_id = kvm_arch_vcpu_id(cpu);
+                    break;
+                }
+            }
+        }
+
+        /* Retrieve the virtual package number of each vCPU */
+        for (int i = 0; i < vmsr->guest_cpu_list->len; i++) {
+            for (int j = 0; j < num_threads; j++) {
+                if ((thd_stat[j].acpi_id ==
+                        vmsr->guest_cpu_list->cpus[i].arch_id)
+                    && (thd_stat[j].is_vcpu == true)) {
+                    x86_topo_ids_from_apicid(thd_stat[j].acpi_id,
+                        &vmsr->guest_topo_info, &topo_ids);
+                    thd_stat[j].vpkg_id = topo_ids.pkg_id;
+                }
+            }
+        }
+
+        /* Calculate the total energy of all non-vCPU thread */
+        for (int i = 0; i < num_threads; i++) {
+            if ((thd_stat[i].is_vcpu != true) &&
+                (thd_stat[i].delta_ticks > 0)) {
+                double temp;
+                temp = vmsr_get_ratio(pkg_stat[thd_stat[i].pkg_id].e_delta,
+                    thd_stat[i].delta_ticks,
+                    vmsr->host_topo.maxticks[thd_stat[i].pkg_id]);
+                pkg_stat[thd_stat[i].pkg_id].e_ratio
+                    += (uint64_t)lround(temp);
+            }
+        }
+
+        /* Calculate the ratio per non-vCPU thread of each package */
+        for (int i = 0; i < vmsr->host_topo.maxpkgs; i++) {
+            if (pkg_stat[i].nb_vcpu > 0) {
+                pkg_stat[i].e_ratio = pkg_stat[i].e_ratio / pkg_stat[i].nb_vcpu;
+            }
+        }
+
+        /*
+         * Calculate the energy for each Package:
+         * Energy Package = sum of each vCPU energy that belongs to the package
+         */
+        for (int i = 0; i < num_threads; i++) {
+            if ((thd_stat[i].is_vcpu == true) && \
+                    (thd_stat[i].delta_ticks > 0)) {
+                double temp;
+                temp = vmsr_get_ratio(pkg_stat[thd_stat[i].pkg_id].e_delta,
+                    thd_stat[i].delta_ticks,
+                    vmsr->host_topo.maxticks[thd_stat[i].pkg_id]);
+                vpkgs_energy_stat[thd_stat[i].vpkg_id] +=
+                    (uint64_t)lround(temp);
+                vpkgs_energy_stat[thd_stat[i].vpkg_id] +=
+                    pkg_stat[thd_stat[i].pkg_id].e_ratio;
+            }
+        }
+
+        /*
+         * Finally populate the vmsr register of each vCPU with the total
+         * package value to emulate the real hardware where each CPU return the
+         * value of the package it belongs.
+         */
+        for (int i = 0; i < num_threads; i++) {
+            if ((thd_stat[i].is_vcpu == true) && \
+                    (thd_stat[i].delta_ticks > 0)) {
+                vmsr->msr_value[thd_stat[i].vcpu_id] = \
+                                        vpkgs_energy_stat[thd_stat[i].vpkg_id];
+          }
+        }
+
+        /* Freeing memory before zeroing the pointer */
+        for (int i = 0; i < num_threads; i++) {
+            g_free(thd_stat[i].utime);
+            g_free(thd_stat[i].stime);
+        }
+   }
+
+clean:
+    rcu_unregister_thread();
+    return NULL;
+}
+
+static int kvm_msr_energy_thread_init(KVMState *s, MachineState *ms)
+{
+    MachineClass *mc = MACHINE_GET_CLASS(ms);
+    struct KVMMsrEnergy *r = &s->msr_energy;
+    int ret = 0;
+
+    /*
+     * Sanity check
+     * 1. Host cpu must be Intel cpu
+     * 2. RAPL must be enabled on the Host
+     */
+    if (is_host_cpu_intel()) {
+        error_report("The RAPL feature can only be enabled on hosts\
+                      with Intel CPU models");
+        ret = 1;
+        goto out;
+    }
+
+    if (!is_rapl_enabled()) {
+        ret = 1;
+        goto out;
+    }
+
+    /* Retrieve the virtual topology */
+    vmsr_init_topo_info(&r->guest_topo_info, ms);
+
+    /* Retrieve the number of vcpu */
+    r->guest_vcpus = ms->smp.cpus;
+
+    /* Retrieve the number of virtual sockets */
+    r->guest_vsockets = ms->smp.sockets;
+
+    /* Allocate register memory (MSR_PKG_STATUS) for each vcpu */
+    r->msr_value = g_new0(uint64_t, r->guest_vcpus);
+
+    /* Retrieve the CPUArchIDlist */
+    r->guest_cpu_list = mc->possible_cpu_arch_ids(ms);
+
+    /* Max number of cpus on the Host */
+    r->host_topo.maxcpus = vmsr_get_maxcpus();
+    if (r->host_topo.maxcpus == 0) {
+        error_report("host max cpus = 0");
+        ret = 1;
+        goto out;
+    }
+
+    /* Max number of packages on the host */
+    r->host_topo.maxpkgs = vmsr_get_max_physical_package(r->host_topo.maxcpus);
+    if (r->host_topo.maxpkgs == 0) {
+        error_report("host max pkgs = 0");
+        ret = 1;
+        goto out;
+    }
+
+    /* Allocate memory for each package on the host */
+    r->host_topo.pkg_cpu_count = g_new0(unsigned int, r->host_topo.maxpkgs);
+    r->host_topo.maxticks = g_new0(unsigned int, r->host_topo.maxpkgs);
+
+    vmsr_count_cpus_per_package(r->host_topo.pkg_cpu_count,
+                                r->host_topo.maxpkgs);
+    for (int i = 0; i < r->host_topo.maxpkgs; i++) {
+        if (r->host_topo.pkg_cpu_count[i] == 0) {
+            error_report("cpu per packages = 0 on package_%d", i);
+            ret = 1;
+            goto out;
+        }
+    }
+
+    /* Get QEMU PID*/
+    r->pid = getpid();
+
+    /* Compute the socket path if necessary */
+    if (s->msr_energy.socket_path == NULL) {
+        s->msr_energy.socket_path = vmsr_compute_default_paths();
+    }
+
+    /* Open socket with vmsr helper */
+    s->msr_energy.sioc = vmsr_open_socket(s->msr_energy.socket_path);
+
+    if (s->msr_energy.sioc == NULL) {
+        error_report("vmsr socket opening failed");
+        ret = 1;
+        goto out;
+    }
+
+    /* Those MSR values should not change */
+    r->msr_unit  = vmsr_read_msr(MSR_RAPL_POWER_UNIT, 0, r->pid,
+                                    s->msr_energy.sioc);
+    r->msr_limit = vmsr_read_msr(MSR_PKG_POWER_LIMIT, 0, r->pid,
+                                    s->msr_energy.sioc);
+    r->msr_info  = vmsr_read_msr(MSR_PKG_POWER_INFO, 0, r->pid,
+                                    s->msr_energy.sioc);
+    if (r->msr_unit == 0 || r->msr_limit == 0 || r->msr_info == 0) {
+        error_report("can't read any virtual msr");
+        ret = 1;
+        goto out;
+    }
+
+    qemu_thread_create(&r->msr_thr, "kvm-msr",
+                       kvm_msr_energy_thread,
+                       s, QEMU_THREAD_JOINABLE);
+out:
+    return ret;
+}
+
 int kvm_arch_get_default_type(MachineState *ms)
 {
     return 0;
@@ -2768,6 +3154,49 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
                          strerror(-ret));
             exit(1);
         }
+
+        if (s->msr_energy.enable == true) {
+            r = kvm_filter_msr(s, MSR_RAPL_POWER_UNIT,
+                               kvm_rdmsr_rapl_power_unit, NULL);
+            if (!r) {
+                error_report("Could not install MSR_RAPL_POWER_UNIT \
+                                handler: %s",
+                             strerror(-ret));
+                exit(1);
+            }
+
+            r = kvm_filter_msr(s, MSR_PKG_POWER_LIMIT,
+                               kvm_rdmsr_pkg_power_limit, NULL);
+            if (!r) {
+                error_report("Could not install MSR_PKG_POWER_LIMIT \
+                                handler: %s",
+                             strerror(-ret));
+                exit(1);
+            }
+
+            r = kvm_filter_msr(s, MSR_PKG_POWER_INFO,
+                               kvm_rdmsr_pkg_power_info, NULL);
+            if (!r) {
+                error_report("Could not install MSR_PKG_POWER_INFO \
+                                handler: %s",
+                             strerror(-ret));
+                exit(1);
+            }
+            r = kvm_filter_msr(s, MSR_PKG_ENERGY_STATUS,
+                               kvm_rdmsr_pkg_energy_status, NULL);
+            if (!r) {
+                error_report("Could not install MSR_PKG_ENERGY_STATUS \
+                                handler: %s",
+                             strerror(-ret));
+                exit(1);
+            }
+            r = kvm_msr_energy_thread_init(s, ms);
+            if (r) {
+                error_report("kvm : error RAPL feature requirement not meet");
+                exit(1);
+            }
+
+        }
     }
 
     return 0;
diff --git a/target/i386/kvm/meson.build b/target/i386/kvm/meson.build
index e7850981e62d..3996cafaf29f 100644
--- a/target/i386/kvm/meson.build
+++ b/target/i386/kvm/meson.build
@@ -3,6 +3,7 @@ i386_kvm_ss = ss.source_set()
 i386_kvm_ss.add(files(
   'kvm.c',
   'kvm-cpu.c',
+  'vmsr_energy.c',
 ))
 
 i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen-emu.c'))
diff --git a/target/i386/kvm/vmsr_energy.c b/target/i386/kvm/vmsr_energy.c
new file mode 100644
index 000000000000..acf0fc0a2fb3
--- /dev/null
+++ b/target/i386/kvm/vmsr_energy.c
@@ -0,0 +1,344 @@
+/*
+ * QEMU KVM support -- x86 virtual RAPL msr
+ *
+ * Copyright 2024 Red Hat, Inc. 2024
+ *
+ *  Author:
+ *      Anthony Harivel <aharivel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "vmsr_energy.h"
+#include "io/channel.h"
+#include "io/channel-socket.h"
+#include "hw/boards.h"
+#include "cpu.h"
+#include "host-cpu.h"
+
+char *vmsr_compute_default_paths(void)
+{
+    g_autofree char *state = qemu_get_local_state_dir();
+
+    return g_build_filename(state, "run", "qemu-vmsr-helper.sock", NULL);
+}
+
+bool is_host_cpu_intel(void)
+{
+    int family, model, stepping;
+    char vendor[CPUID_VENDOR_SZ + 1];
+
+    host_cpu_vendor_fms(vendor, &family, &model, &stepping);
+
+    return strcmp(vendor, CPUID_VENDOR_INTEL);
+}
+
+int is_rapl_enabled(void)
+{
+    const char *path = "/sys/class/powercap/intel-rapl/enabled";
+    FILE *file = fopen(path, "r");
+    int value = 0;
+
+    if (file != NULL) {
+        if (fscanf(file, "%d", &value) != 1) {
+            error_report("INTEL RAPL not enabled");
+        }
+        fclose(file);
+    } else {
+        error_report("Error opening %s", path);
+    }
+
+    return value;
+}
+
+QIOChannelSocket *vmsr_open_socket(const char *path)
+{
+    g_autofree char *socket_path = NULL;
+
+    socket_path = g_strdup(path);
+
+    SocketAddress saddr = {
+        .type = SOCKET_ADDRESS_TYPE_UNIX,
+        .u.q_unix.path = socket_path
+    };
+
+    QIOChannelSocket *sioc = qio_channel_socket_new();
+    Error *local_err = NULL;
+
+    qio_channel_set_name(QIO_CHANNEL(sioc), "vmsr-helper");
+    qio_channel_socket_connect_sync(sioc,
+                                    &saddr,
+                                    &local_err);
+    if (local_err) {
+        /* Close socket. */
+        qio_channel_close(QIO_CHANNEL(sioc), NULL);
+        object_unref(OBJECT(sioc));
+        sioc = NULL;
+        goto out;
+    }
+
+    qio_channel_set_delay(QIO_CHANNEL(sioc), false);
+out:
+    return sioc;
+}
+
+uint64_t vmsr_read_msr(uint32_t reg, uint32_t cpu_id, uint32_t tid,
+                       QIOChannelSocket *sioc)
+{
+    uint64_t data = 0;
+    int r = 0;
+    Error *local_err = NULL;
+    uint32_t buffer[3];
+    /*
+     * Send the required arguments:
+     * 1. RAPL MSR register to read
+     * 2. On which CPU ID
+     * 3. From which vCPU (Thread ID)
+     */
+    buffer[0] = reg;
+    buffer[1] = cpu_id;
+    buffer[2] = tid;
+
+    r = qio_channel_write_all(QIO_CHANNEL(sioc),
+                              (char *)buffer, sizeof(buffer),
+                              &local_err);
+    if (r < 0) {
+        goto out_close;
+    }
+
+    r = qio_channel_read(QIO_CHANNEL(sioc),
+                             (char *)&data, sizeof(data),
+                             &local_err);
+    if (r < 0) {
+        data = 0;
+        goto out_close;
+    }
+
+out_close:
+   return data;
+}
+
+/* Retrieve the max number of physical package */
+unsigned int vmsr_get_max_physical_package(unsigned int max_cpus)
+{
+    const char *dir = "/sys/devices/system/cpu/";
+    const char *topo_path = "topology/physical_package_id";
+    g_autofree int *uniquePackages = g_new0(int, max_cpus);
+    unsigned int packageCount = 0;
+    FILE *file = NULL;
+
+    for (int i = 0; i < max_cpus; i++) {
+        g_autofree char *filePath = NULL;
+        g_autofree char *cpuid = g_strdup_printf("cpu%d", i);
+
+        filePath = g_build_filename(dir, cpuid, topo_path, NULL);
+
+        file = fopen(filePath, "r");
+
+        if (file == NULL) {
+            error_report("Error opening physical_package_id file");
+            return 0;
+        }
+
+        char packageId[10];
+        if (fgets(packageId, sizeof(packageId), file) == NULL) {
+            packageCount = 0;
+        }
+
+        fclose(file);
+
+        int currentPackageId = atoi(packageId);
+
+        bool isUnique = true;
+        for (int j = 0; j < packageCount; j++) {
+            if (uniquePackages[j] == currentPackageId) {
+                isUnique = false;
+                break;
+            }
+        }
+
+        if (isUnique) {
+            uniquePackages[packageCount] = currentPackageId;
+            packageCount++;
+
+            if (packageCount >= max_cpus) {
+                break;
+            }
+        }
+    }
+
+    return (packageCount == 0) ? 1 : packageCount;
+}
+
+/* Retrieve the max number of physical cpu on the host */
+unsigned int vmsr_get_maxcpus(void)
+{
+    GDir *dir;
+    const gchar *entry_name;
+    unsigned int cpu_count = 0;
+    const char *path = "/sys/devices/system/cpu/";
+
+    dir = g_dir_open(path, 0, NULL);
+    if (dir == NULL) {
+        error_report("Unable to open cpu directory");
+        return -1;
+    }
+
+    while ((entry_name = g_dir_read_name(dir)) != NULL) {
+        if (g_ascii_strncasecmp(entry_name, "cpu", 3) == 0 &&
+            isdigit(entry_name[3])) {
+            cpu_count++;
+        }
+    }
+
+    g_dir_close(dir);
+
+    return cpu_count;
+}
+
+/* Count the number of physical cpu on each packages */
+unsigned int vmsr_count_cpus_per_package(unsigned int *package_count,
+                                         unsigned int max_pkgs)
+{
+    g_autofree char *file_contents = NULL;
+    g_autofree char *path = NULL;
+    g_autofree char *path_name = NULL;
+    gsize length;
+
+    /* Iterate over cpus and count cpus in each package */
+    for (int cpu_id = 0; ; cpu_id++) {
+        path_name = g_strdup_printf("/sys/devices/system/cpu/cpu%d/"
+            "topology/physical_package_id", cpu_id);
+
+        path = g_build_filename(path_name, NULL);
+
+        if (!g_file_get_contents(path, &file_contents, &length, NULL)) {
+            break; /* No more cpus */
+        }
+
+        /* Get the physical package ID for this CPU */
+        int package_id = atoi(file_contents);
+
+        /* Check if the package ID is within the known number of packages */
+        if (package_id >= 0 && package_id < max_pkgs) {
+            /* If yes, count the cpu for this package*/
+            package_count[package_id]++;
+        }
+    }
+
+    return 0;
+}
+
+/* Get the physical package id from a given cpu id */
+int vmsr_get_physical_package_id(int cpu_id)
+{
+    g_autofree char *file_contents = NULL;
+    g_autofree char *file_path = NULL;
+    int package_id = -1;
+    gsize length;
+
+    file_path = g_strdup_printf("/sys/devices/system/cpu/cpu%d"
+        "/topology/physical_package_id", cpu_id);
+
+    if (!g_file_get_contents(file_path, &file_contents, &length, NULL)) {
+        goto out;
+    }
+
+    package_id = atoi(file_contents);
+
+out:
+    return package_id;
+}
+
+/* Read the scheduled time for a given thread of a give pid */
+void vmsr_read_thread_stat(pid_t pid,
+                      unsigned int thread_id,
+                      unsigned long long *utime,
+                      unsigned long long *stime,
+                      unsigned int *cpu_id)
+{
+    g_autofree char *path = NULL;
+    g_autofree char *path_name = NULL;
+
+    path_name = g_strdup_printf("/proc/%u/task/%d/stat", pid, thread_id);
+
+    path = g_build_filename(path_name, NULL);
+
+    FILE *file = fopen(path, "r");
+    if (file == NULL) {
+        pid = -1;
+        return;
+    }
+
+    if (fscanf(file, "%*d (%*[^)]) %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u"
+        " %llu %llu %*d %*d %*d %*d %*d %*d %*u %*u %*d %*u %*u"
+        " %*u %*u %*u %*u %*u %*u %*u %*u %*u %*d %*u %*u %u",
+           utime, stime, cpu_id) != 3)
+    {
+        pid = -1;
+        return;
+    }
+
+    fclose(file);
+    return;
+}
+
+/* Read QEMU stat task folder to retrieve all QEMU threads ID */
+pid_t *vmsr_get_thread_ids(pid_t pid, unsigned int *num_threads)
+{
+    g_autofree char *task_path = g_strdup_printf("%d/task", pid);
+    g_autofree char *path = g_build_filename("/proc", task_path, NULL);
+
+    DIR *dir = opendir(path);
+    if (dir == NULL) {
+        error_report("Error opening /proc/qemu/task");
+        return NULL;
+    }
+
+    pid_t *thread_ids = NULL;
+    unsigned int thread_count = 0;
+
+    g_autofree struct dirent *ent = NULL;
+    while ((ent = readdir(dir)) != NULL) {
+        if (ent->d_name[0] == '.') {
+            continue;
+        }
+        pid_t tid = atoi(ent->d_name);
+        if (pid != tid) {
+            thread_ids = g_renew(pid_t, thread_ids, (thread_count + 1));
+            thread_ids[thread_count] = tid;
+            thread_count++;
+        }
+    }
+
+    closedir(dir);
+
+    *num_threads = thread_count;
+    return thread_ids;
+}
+
+void vmsr_delta_ticks(vmsr_thread_stat *thd_stat, int i)
+{
+    thd_stat[i].delta_ticks = (thd_stat[i].utime[1] + thd_stat[i].stime[1])
+                            - (thd_stat[i].utime[0] + thd_stat[i].stime[0]);
+}
+
+double vmsr_get_ratio(uint64_t e_delta,
+                      unsigned long long delta_ticks,
+                      unsigned int maxticks)
+{
+    return (e_delta / 100.0) * ((100.0 / maxticks) * delta_ticks);
+}
+
+void vmsr_init_topo_info(X86CPUTopoInfo *topo_info,
+                           const MachineState *ms)
+{
+    topo_info->dies_per_pkg = ms->smp.dies;
+    topo_info->cores_per_die = ms->smp.cores;
+    topo_info->threads_per_core = ms->smp.threads;
+}
+
diff --git a/target/i386/kvm/vmsr_energy.h b/target/i386/kvm/vmsr_energy.h
new file mode 100644
index 000000000000..16cc1f4814f6
--- /dev/null
+++ b/target/i386/kvm/vmsr_energy.h
@@ -0,0 +1,99 @@
+/*
+ * QEMU KVM support -- x86 virtual energy-related MSR.
+ *
+ * Copyright 2024 Red Hat, Inc. 2024
+ *
+ *  Author:
+ *      Anthony Harivel <aharivel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef VMSR_ENERGY_H
+#define VMSR_ENERGY_H
+
+#include <stdint.h>
+#include "qemu/osdep.h"
+#include "io/channel-socket.h"
+#include "hw/i386/topology.h"
+
+/*
+ * Define the interval time in micro seconds between 2 samples of
+ * energy related MSRs
+ */
+#define MSR_ENERGY_THREAD_SLEEP_US 1000000.0
+
+/*
+ * Thread statistic
+ * @ thread_id: TID (thread ID)
+ * @ is_vcpu: true if TID is vCPU thread
+ * @ cpu_id: CPU number last executed on
+ * @ pkg_id: package number of the CPU
+ * @ vcpu_id: vCPU ID
+ * @ vpkg: virtual package number
+ * @ acpi_id: APIC id of the vCPU
+ * @ utime: amount of clock ticks the thread
+ *          has been scheduled in User mode
+ * @ stime: amount of clock ticks the thread
+ *          has been scheduled in System mode
+ * @ delta_ticks: delta of utime+stime between
+ *          the two samples (before/after sleep)
+ */
+struct vmsr_thread_stat {
+    unsigned int thread_id;
+    bool is_vcpu;
+    unsigned int cpu_id;
+    unsigned int pkg_id;
+    unsigned int vpkg_id;
+    unsigned int vcpu_id;
+    unsigned long acpi_id;
+    unsigned long long *utime;
+    unsigned long long *stime;
+    unsigned long long delta_ticks;
+};
+
+/*
+ * Package statistic
+ * @ e_start: package energy counter before the sleep
+ * @ e_end: package energy counter after the sleep
+ * @ e_delta: delta of package energy counter
+ * @ e_ratio: store the energy ratio of non-vCPU thread
+ * @ nb_vcpu: number of vCPU running on this package
+ */
+struct vmsr_package_energy_stat {
+    uint64_t e_start;
+    uint64_t e_end;
+    uint64_t e_delta;
+    uint64_t e_ratio;
+    unsigned int nb_vcpu;
+};
+
+typedef struct vmsr_thread_stat vmsr_thread_stat;
+typedef struct vmsr_package_energy_stat vmsr_package_energy_stat;
+
+char *vmsr_compute_default_paths(void);
+void vmsr_read_thread_stat(pid_t pid,
+                      unsigned int thread_id,
+                      unsigned long long *utime,
+                      unsigned long long *stime,
+                      unsigned int *cpu_id);
+
+QIOChannelSocket *vmsr_open_socket(const char *path);
+uint64_t vmsr_read_msr(uint32_t reg, uint32_t cpu_id,
+                       uint32_t tid, QIOChannelSocket *sioc);
+void vmsr_delta_ticks(vmsr_thread_stat *thd_stat, int i);
+unsigned int vmsr_get_maxcpus(void);
+unsigned int vmsr_get_max_physical_package(unsigned int max_cpus);
+unsigned int vmsr_count_cpus_per_package(unsigned int *package_count,
+                                         unsigned int max_pkgs);
+int vmsr_get_physical_package_id(int cpu_id);
+pid_t *vmsr_get_thread_ids(pid_t pid, unsigned int *num_threads);
+double vmsr_get_ratio(uint64_t e_delta,
+                      unsigned long long delta_ticks,
+                      unsigned int maxticks);
+void vmsr_init_topo_info(X86CPUTopoInfo *topo_info, const MachineState *ms);
+bool is_host_cpu_intel(void);
+int is_rapl_enabled(void);
+#endif /* VMSR_ENERGY_H */
-- 
2.45.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 0/3] Add support for the RAPL MSRs series
  2024-05-22 15:34 [PATCH v6 0/3] Add support for the RAPL MSRs series Anthony Harivel
                   ` (2 preceding siblings ...)
  2024-05-22 15:34 ` [PATCH v6 3/3] Add support for RAPL MSRs in KVM/Qemu Anthony Harivel
@ 2024-06-26 14:34 ` Anthony Harivel
  2024-10-16 11:52 ` Igor Mammedov
  4 siblings, 0 replies; 25+ messages in thread
From: Anthony Harivel @ 2024-06-26 14:34 UTC (permalink / raw)
  To: pbonzini, mtosatti, berrange; +Cc: qemu-devel, vchundur, rjarry


Just a gentle ping for the above patch series.


Anthony Harivel, May 22, 2024 at 17:34:
> Dear maintainers, 
>
> First of all, thank you very much for your review of my patch 
> [1].
>
> In this version (v6), I have attempted to address all the problems 
> addressed by Daniel and Paolo during the last review. 
>
> However, two open questions remains unanswered that would require the 
> attention of a x86 maintainers: 
>
> 1)Should I move from -kvm to -cpu the rapl feature ? [2]
>
> 2)Should I already rename to "rapl_vmsr_*" in order to anticipate the 
>   futur TMPI architecture ? [end of 3] 
>
> Thank you again for your continued guidance. 
>
> v5 -> v6
> --------
> - Better error consistency in qio_channel_get_peerpid()
> - Memory leak g_strdup_printf/g_build_filename corrected
> - Renaming several struct with "vmsr_*" for better namespace
> - Renamed several struct with "guest_*" for better comprehension
> - Optimization suggerate from Daniel
> - Crash problem solved [4]
>
> v4 -> v5
> --------
>
> - correct qio_channel_get_peerpid: return pid = -1 in case of error
> - Vmsr_helper: compile only for x86
> - Vmsr_helper: use qio_channel_read/write_all
> - Vmsr_helper: abandon user/group
> - Vmsr_energy.c: correct all error_report
> - Vmsr thread: compute default socket path only once
> - Vmsr thread: open socket only once
> - Pass relevant QEMU CI
>
> v3 -> v4
> --------
>
> - Correct memory leaks with AddressSanitizer  
> - Add sanity check for QEMU and qemu-vmsr-helper for checking if host is 
>   INTEL and if RAPL is activated.
> - Rename poor variables naming for easier comprehension
> - Move code that checks Host before creating the VMSR thread
> - Get rid of libnuma: create function that read sysfs for reading the 
>   Host topology instead
>
> v2 -> v3
> --------
>
> - Move all memory allocations from Clib to Glib
> - Compile on *BSD (working on Linux only)
> - No more limitation on the virtual package: each vCPU that belongs to 
>   the same virtual package is giving the same results like expected on 
>   a real CPU.
>   This has been tested topology like:
>      -smp 4,sockets=2
>      -smp 16,sockets=4,cores=2,threads=2
>
> v1 -> v2
> --------
>
> - To overcome the CVE-2020-8694 a socket communication is created
>   to a priviliged helper
> - Add the priviliged helper (qemu-vmsr-helper)
> - Add SO_PEERCRED in qio channel socket
>
> RFC -> v1
> ---------
>
> - Add vmsr_* in front of all vmsr specific function
> - Change malloc()/calloc()... with all glib equivalent
> - Pre-allocate all dynamic memories when possible
> - Add a Documentation of implementation, limitation and usage
>
> Best regards,
> Anthony
>
> [1]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg01570.html
> [2]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg03947.html
> [3]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02350.html
> [4]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02481.html
>
> Anthony Harivel (3):
>   qio: add support for SO_PEERCRED for socket channel
>   tools: build qemu-vmsr-helper
>   Add support for RAPL MSRs in KVM/Qemu
>
>  accel/kvm/kvm-all.c                      |  27 ++
>  contrib/systemd/qemu-vmsr-helper.service |  15 +
>  contrib/systemd/qemu-vmsr-helper.socket  |   9 +
>  docs/specs/index.rst                     |   1 +
>  docs/specs/rapl-msr.rst                  | 155 +++++++
>  docs/tools/index.rst                     |   1 +
>  docs/tools/qemu-vmsr-helper.rst          |  89 ++++
>  include/io/channel.h                     |  21 +
>  include/sysemu/kvm_int.h                 |  32 ++
>  io/channel-socket.c                      |  28 ++
>  io/channel.c                             |  13 +
>  meson.build                              |   7 +
>  target/i386/cpu.h                        |   8 +
>  target/i386/kvm/kvm.c                    | 431 +++++++++++++++++-
>  target/i386/kvm/meson.build              |   1 +
>  target/i386/kvm/vmsr_energy.c            | 337 ++++++++++++++
>  target/i386/kvm/vmsr_energy.h            |  99 +++++
>  tools/i386/qemu-vmsr-helper.c            | 530 +++++++++++++++++++++++
>  tools/i386/rapl-msr-index.h              |  28 ++
>  19 files changed, 1831 insertions(+), 1 deletion(-)
>  create mode 100644 contrib/systemd/qemu-vmsr-helper.service
>  create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
>  create mode 100644 docs/specs/rapl-msr.rst
>  create mode 100644 docs/tools/qemu-vmsr-helper.rst
>  create mode 100644 target/i386/kvm/vmsr_energy.c
>  create mode 100644 target/i386/kvm/vmsr_energy.h
>  create mode 100644 tools/i386/qemu-vmsr-helper.c
>  create mode 100644 tools/i386/rapl-msr-index.h
>
> -- 
> 2.45.1





^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 0/3] Add support for the RAPL MSRs series
  2024-05-22 15:34 [PATCH v6 0/3] Add support for the RAPL MSRs series Anthony Harivel
                   ` (3 preceding siblings ...)
  2024-06-26 14:34 ` [PATCH v6 0/3] Add support for the RAPL MSRs series Anthony Harivel
@ 2024-10-16 11:52 ` Igor Mammedov
  2024-10-16 12:56   ` Anthony Harivel
  4 siblings, 1 reply; 25+ messages in thread
From: Igor Mammedov @ 2024-10-16 11:52 UTC (permalink / raw)
  To: Anthony Harivel
  Cc: pbonzini, mtosatti, berrange, qemu-devel, vchundur, rjarry

On Wed, 22 May 2024 17:34:49 +0200
Anthony Harivel <aharivel@redhat.com> wrote:

> Dear maintainers, 
> 
> First of all, thank you very much for your review of my patch 
> [1].

I've tried to play with this feature and have a few questions about it

 1. trying to start with non accessible or not existent socket
        -accel kvm,rapl=on,rapl-helper-socket=/tmp/socket 
    I get:
      qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: vmsr socket opening failed
      qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: kvm : error RAPL feature requirement not met
    * is it possible to report actual OS error that happened during open/connect,
      instead of unhelpful 'socket opening failed'?

      What I see in vmsr_open_socket() error is ignored
      and btw it's error leak as well

    * 2nd line shouldn't be there if the 1st error already present.

 2.  getting periodic error on console where QEMU has been starter
      # ./qemu-vmsr-helper -k /tmp/sock
     ./qemu-system-x86_64 -snapshot -m 4G -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock rhel90.img  -vnc :0 -cpu host
     and let it run

      it appears rdmsr works (well, it returns some values at least)
      however there are recurring errors in qemu's stderr(or out)
      
      qemu-system-x86_64: Error opening /proc/2496093/task/2496109/stat
      qemu-system-x86_64: Error opening /proc/2496093/task/2496095/stat

      My guess it's some temporary threads, that come and go, but still
      they shouldn't cause errors if it's normal operation.

      Also on daemon side, I a few times got while guest was running:
        qemu-vmsr-helper: Failed to open /proc at /proc/2496026/task/2496044
        qemu-vmsr-helper: Requested TID not in peer PID: 2496026 2496044
      though I can't reproduce it reliably

 3. when starting daemon not as root, it starts 'fine' but later on complains
      qemu-vmsr-helper: Failed to open MSR file at /dev/cpu/0/msr
    perhaps it would be better to fail at start daemon if it doesn't have
    access to necessary files.

 4. in case #3, guest also fails to start with errors:
      qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: can't read any virtual msr
      qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: kvm : error RAPL feature requirement not met
     again line #2 is not useful and probably not needed (maybe make it tracepoint)
     and #1 is unhelpful - it would be better if it directed user to check qemu-vmsr-helper

 5. does AMD have similar MSRs that we could use to make this feature complete?

 6. What happens to power accounting if host constantly migrates
    vcpus between sockets, are values we are getting still correct/meaningful?
    Or do we need to pin vcpus to get 'accurate' values?

 7. do we have to have a dedicated thread for pooling data from daemon?

    Can we fetch data from vcpu thread that have accessed msr
    (with some caching and rate limiting access to the daemon)?

> In this version (v6), I have attempted to address all the problems 
> addressed by Daniel and Paolo during the last review. 
> 
> However, two open questions remains unanswered that would require the 
> attention of a x86 maintainers: 
> 
> 1)Should I move from -kvm to -cpu the rapl feature ? [2]
> 
> 2)Should I already rename to "rapl_vmsr_*" in order to anticipate the 
>   futur TMPI architecture ? [end of 3] 
> 
> Thank you again for your continued guidance. 
> 
> v5 -> v6
> --------
> - Better error consistency in qio_channel_get_peerpid()
> - Memory leak g_strdup_printf/g_build_filename corrected
> - Renaming several struct with "vmsr_*" for better namespace
> - Renamed several struct with "guest_*" for better comprehension
> - Optimization suggerate from Daniel
> - Crash problem solved [4]
> 
> v4 -> v5
> --------
> 
> - correct qio_channel_get_peerpid: return pid = -1 in case of error
> - Vmsr_helper: compile only for x86
> - Vmsr_helper: use qio_channel_read/write_all
> - Vmsr_helper: abandon user/group
> - Vmsr_energy.c: correct all error_report
> - Vmsr thread: compute default socket path only once
> - Vmsr thread: open socket only once
> - Pass relevant QEMU CI
> 
> v3 -> v4
> --------
> 
> - Correct memory leaks with AddressSanitizer  
> - Add sanity check for QEMU and qemu-vmsr-helper for checking if host is 
>   INTEL and if RAPL is activated.
> - Rename poor variables naming for easier comprehension
> - Move code that checks Host before creating the VMSR thread
> - Get rid of libnuma: create function that read sysfs for reading the 
>   Host topology instead
> 
> v2 -> v3
> --------
> 
> - Move all memory allocations from Clib to Glib
> - Compile on *BSD (working on Linux only)
> - No more limitation on the virtual package: each vCPU that belongs to 
>   the same virtual package is giving the same results like expected on 
>   a real CPU.
>   This has been tested topology like:
>      -smp 4,sockets=2
>      -smp 16,sockets=4,cores=2,threads=2
> 
> v1 -> v2
> --------
> 
> - To overcome the CVE-2020-8694 a socket communication is created
>   to a priviliged helper
> - Add the priviliged helper (qemu-vmsr-helper)
> - Add SO_PEERCRED in qio channel socket
> 
> RFC -> v1
> ---------
> 
> - Add vmsr_* in front of all vmsr specific function
> - Change malloc()/calloc()... with all glib equivalent
> - Pre-allocate all dynamic memories when possible
> - Add a Documentation of implementation, limitation and usage
> 
> Best regards,
> Anthony
> 
> [1]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg01570.html
> [2]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg03947.html
> [3]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02350.html
> [4]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02481.html
> 
> Anthony Harivel (3):
>   qio: add support for SO_PEERCRED for socket channel
>   tools: build qemu-vmsr-helper
>   Add support for RAPL MSRs in KVM/Qemu
> 
>  accel/kvm/kvm-all.c                      |  27 ++
>  contrib/systemd/qemu-vmsr-helper.service |  15 +
>  contrib/systemd/qemu-vmsr-helper.socket  |   9 +
>  docs/specs/index.rst                     |   1 +
>  docs/specs/rapl-msr.rst                  | 155 +++++++
>  docs/tools/index.rst                     |   1 +
>  docs/tools/qemu-vmsr-helper.rst          |  89 ++++
>  include/io/channel.h                     |  21 +
>  include/sysemu/kvm_int.h                 |  32 ++
>  io/channel-socket.c                      |  28 ++
>  io/channel.c                             |  13 +
>  meson.build                              |   7 +
>  target/i386/cpu.h                        |   8 +
>  target/i386/kvm/kvm.c                    | 431 +++++++++++++++++-
>  target/i386/kvm/meson.build              |   1 +
>  target/i386/kvm/vmsr_energy.c            | 337 ++++++++++++++
>  target/i386/kvm/vmsr_energy.h            |  99 +++++
>  tools/i386/qemu-vmsr-helper.c            | 530 +++++++++++++++++++++++
>  tools/i386/rapl-msr-index.h              |  28 ++
>  19 files changed, 1831 insertions(+), 1 deletion(-)
>  create mode 100644 contrib/systemd/qemu-vmsr-helper.service
>  create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
>  create mode 100644 docs/specs/rapl-msr.rst
>  create mode 100644 docs/tools/qemu-vmsr-helper.rst
>  create mode 100644 target/i386/kvm/vmsr_energy.c
>  create mode 100644 target/i386/kvm/vmsr_energy.h
>  create mode 100644 tools/i386/qemu-vmsr-helper.c
>  create mode 100644 tools/i386/rapl-msr-index.h
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-05-22 15:34 ` [PATCH v6 3/3] Add support for RAPL MSRs in KVM/Qemu Anthony Harivel
@ 2024-10-16 12:17   ` Igor Mammedov
  2024-10-16 13:04     ` Anthony Harivel
  0 siblings, 1 reply; 25+ messages in thread
From: Igor Mammedov @ 2024-10-16 12:17 UTC (permalink / raw)
  To: Anthony Harivel
  Cc: pbonzini, mtosatti, berrange, qemu-devel, vchundur, rjarry

On Wed, 22 May 2024 17:34:52 +0200
Anthony Harivel <aharivel@redhat.com> wrote:

> Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
> interface (Running Average Power Limit) for advertising the accumulated
> energy consumption of various power domains (e.g. CPU packages, DRAM,
> etc.).
> 
> The consumption is reported via MSRs (model specific registers) like
> MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
> 64 bits registers that represent the accumulated energy consumption in
> micro Joules. They are updated by microcode every ~1ms.
> 
> For now, KVM always returns 0 when the guest requests the value of
> these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
> these MSRs dynamically in userspace.
> 
> To limit the amount of system calls for every MSR call, create a new
> thread in QEMU that updates the "virtual" MSR values asynchronously.
> 
> Each vCPU has its own vMSR to reflect the independence of vCPUs. The
> thread updates the vMSR values with the ratio of energy consumed of
> the whole physical CPU package the vCPU thread runs on and the
> thread's utime and stime values.
> 
> All other non-vCPU threads are also taken into account. Their energy
> consumption is evenly distributed among all vCPUs threads running on
> the same physical CPU package.
> 
> To overcome the problem that reading the RAPL MSR requires priviliged
> access, a socket communication between QEMU and the qemu-vmsr-helper is
> mandatory. You can specified the socket path in the parameter.
> 
> This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock
> 
> Actual limitation:
> - Works only on Intel host CPU because AMD CPUs are using different MSR
>   adresses.
> 
> - Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at
>   the moment.
> 
> Signed-off-by: Anthony Harivel <aharivel@redhat.com>
> ---
>  accel/kvm/kvm-all.c           |  27 +++
>  docs/specs/index.rst          |   1 +
>  docs/specs/rapl-msr.rst       | 155 ++++++++++++
>  include/sysemu/kvm_int.h      |  32 +++
>  target/i386/cpu.h             |   8 +
>  target/i386/kvm/kvm.c         | 431 +++++++++++++++++++++++++++++++++-
>  target/i386/kvm/meson.build   |   1 +
>  target/i386/kvm/vmsr_energy.c | 344 +++++++++++++++++++++++++++
>  target/i386/kvm/vmsr_energy.h |  99 ++++++++
>  9 files changed, 1097 insertions(+), 1 deletion(-)
>  create mode 100644 docs/specs/rapl-msr.rst
>  create mode 100644 target/i386/kvm/vmsr_energy.c
>  create mode 100644 target/i386/kvm/vmsr_energy.h
> 


> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index c0be9f5eedb8..f455e6b987b4 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -3745,6 +3745,21 @@ static void kvm_set_device(Object *obj,
>      s->device = g_strdup(value);
>  }
>  
> +static void kvm_set_kvm_rapl(Object *obj, bool value, Error **errp)
> +{
> +    KVMState *s = KVM_STATE(obj);
> +    s->msr_energy.enable = value;
> +}
> +
> +static void kvm_set_kvm_rapl_socket_path(Object *obj,
> +                                         const char *str,
> +                                         Error **errp)
> +{
> +    KVMState *s = KVM_STATE(obj);
> +    g_free(s->msr_energy.socket_path);
> +    s->msr_energy.socket_path = g_strdup(str);
> +}
> +
>  static void kvm_accel_instance_init(Object *obj)
>  {
>      KVMState *s = KVM_STATE(obj);
> @@ -3764,6 +3779,7 @@ static void kvm_accel_instance_init(Object *obj)
>      s->xen_gnttab_max_frames = 64;
>      s->xen_evtchn_max_pirq = 256;
>      s->device = NULL;
> +    s->msr_energy.enable = false;
>  }
>  
>  /**
> @@ -3808,6 +3824,17 @@ static void kvm_accel_class_init(ObjectClass *oc, void *data)
>      object_class_property_set_description(oc, "device",
>          "Path to the device node to use (default: /dev/kvm)");
>  
> +    object_class_property_add_bool(oc, "rapl",
> +                                   NULL,
> +                                   kvm_set_kvm_rapl);
> +    object_class_property_set_description(oc, "rapl",
> +        "Allow energy related MSRs for RAPL interface in Guest");
> +
> +    object_class_property_add_str(oc, "rapl-helper-socket", NULL,
> +                                  kvm_set_kvm_rapl_socket_path);
> +    object_class_property_set_description(oc, "rapl-helper-socket",
> +        "Socket Path for comminucating with the Virtual MSR helper daemon");
> +
>      kvm_arch_accel_class_init(oc);
>  }

it seems, RAPL is x86 specific feature, so why it is in generic KVM code instead of
target/i386/kvm/kvm.c: kvm_arch_accel_class_init()

>  
> diff --git a/docs/specs/index.rst b/docs/specs/index.rst
> index 1484e3e76077..e738ea7d102f 100644
> --- a/docs/specs/index.rst
> +++ b/docs/specs/index.rst
> @@ -33,3 +33,4 @@ guest hardware that is specific to QEMU.
>     virt-ctlr
>     vmcoreinfo
>     vmgenid
> +   rapl-msr
> diff --git a/docs/specs/rapl-msr.rst b/docs/specs/rapl-msr.rst
> new file mode 100644
> index 000000000000..1202ee89bee0
> --- /dev/null
> +++ b/docs/specs/rapl-msr.rst
> @@ -0,0 +1,155 @@
> +================
> +RAPL MSR support
> +================
> +
> +The RAPL interface (Running Average Power Limit) is advertising the accumulated
> +energy consumption of various power domains (e.g. CPU packages, DRAM, etc.).
> +
> +The consumption is reported via MSRs (model specific registers) like
> +MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are 64 bits
> +registers that represent the accumulated energy consumption in micro Joules.
> +
> +Thanks to the MSR Filtering patch [#a]_ not all MSRs are handled by KVM. Some
> +of them can now be handled by the userspace (QEMU). It uses a mechanism called
> +"MSR filtering" where a list of MSRs is given at init time of a VM to KVM so
> +that a callback is put in place. The design of this patch uses only this
> +mechanism for handling the MSRs between guest/host.
> +
> +At the moment the following MSRs are involved:
> +
> +.. code:: C
> +
> +    #define MSR_RAPL_POWER_UNIT             0x00000606
> +    #define MSR_PKG_POWER_LIMIT             0x00000610
> +    #define MSR_PKG_ENERGY_STATUS           0x00000611
> +    #define MSR_PKG_POWER_INFO              0x00000614
> +
> +The ``*_POWER_UNIT``, ``*_POWER_LIMIT``, ``*_POWER INFO`` are part of the RAPL
> +spec and specify the power limit of the package, provide range of parameter(min
> +power, max power,..) and also the information of the multiplier for the energy
> +counter to calculate the power. Those MSRs are populated once at the beginning
> +by reading the host CPU MSRs and are given back to the guest 1:1 when
> +requested.
> +
> +The MSR_PKG_ENERGY_STATUS is a counter; it represents the total amount of
> +energy consumed since the last time the register was cleared. If you multiply
> +it with the UNIT provided above you'll get the power in micro-joules. This
> +counter is always increasing and it increases more or less faster depending on
> +the consumption of the package. This counter is supposed to overflow at some
> +point.
> +
> +Each core belonging to the same Package reading the MSR_PKG_ENERGY_STATUS (i.e
> +"rdmsr 0x611") will retrieve the same value. The value represents the energy
> +for the whole package. Whatever Core reading it will get the same value and a
> +core that belongs to PKG-0 will not be able to get the value of PKG-1 and
> +vice-versa.
> +
> +High level implementation
> +-------------------------
> +
> +In order to update the value of the virtual MSR, a QEMU thread is created.
> +The thread is basically just an infinity loop that does:
> +
> +1. Snapshot of the time metrics of all QEMU threads (Time spent scheduled in
> +   Userspace and System)
> +
> +2. Snapshot of the actual MSR_PKG_ENERGY_STATUS counter of all packages where
> +   the QEMU threads are running on.
> +
> +3. Sleep for 1 second - During this pause the vcpu and other non-vcpu threads
> +   will do what they have to do and so the energy counter will increase.
> +
> +4. Repeat 2. and 3. and calculate the delta of every metrics representing the
> +   time spent scheduled for each QEMU thread *and* the energy spent by the
> +   packages during the pause.
> +
> +5. Filter the vcpu threads and the non-vcpu threads.
> +
> +6. Retrieve the topology of the Virtual Machine. This helps identify which
> +   vCPU is running on which virtual package.
> +
> +7. The total energy spent by the non-vcpu threads is divided by the number
> +   of vcpu threads so that each vcpu thread will get an equal part of the
> +   energy spent by the QEMU workers.
> +
> +8. Calculate the ratio of energy spent per vcpu threads.
> +
> +9. Calculate the energy for each virtual package.
> +
> +10. The virtual MSRs are updated for each virtual package. Each vCPU that
> +    belongs to the same package will return the same value when accessing the
> +    the MSR.
> +
> +11. Loop back to 1.
> +
> +Ratio calculation
> +-----------------
> +
> +In Linux, a process has an execution time associated with it. The scheduler is
> +dividing the time in clock ticks. The number of clock ticks per second can be
> +found by the sysconf system call. A typical value of clock ticks per second is
> +100. So a core can run a process at the maximum of 100 ticks per second. If a
> +package has 4 cores, 400 ticks maximum can be scheduled on all the cores
> +of the package for a period of 1 second.
> +
> +The /proc/[pid]/stat [#b]_ is a sysfs file that can give the executed time of a
> +process with the [pid] as the process ID. It gives the amount of ticks the
> +process has been scheduled in userspace (utime) and kernel space (stime).
> +
> +By reading those metrics for a thread, one can calculate the ratio of time the
> +package has spent executing the thread.
> +
> +Example:
> +
> +A 4 cores package can schedule a maximum of 400 ticks per second with 100 ticks
> +per second per core. If a thread was scheduled for 100 ticks between a second
> +on this package, that means my thread has been scheduled for 1/4 of the whole
> +package. With that, the calculation of the energy spent by the thread on this
> +package during this whole second is 1/4 of the total energy spent by the
> +package.
> +
> +Usage
> +-----
> +
> +Currently this feature is only working on an Intel CPU that has the RAPL driver
> +mounted and available in the sysfs. if not, QEMU fails at start-up.
> +
> +This feature is activated with -accel
> +kvm,rapl=true,rapl-helper-socket=/path/sock.sock
> +
> +It is important that the socket path is the same as the one
> +:program:`qemu-vmsr-helper` is listening to.
> +
> +qemu-vmsr-helper
> +----------------
> +
> +The qemu-vmsr-helper is working very much like the qemu-pr-helper. Instead of
> +making persistent reservation, qemu-vmsr-helper is here to overcome the
> +CVE-2020-8694 which remove user access to the rapl msr attributes.
> +
> +A socket communication is established between QEMU processes that has the RAPL
> +MSR support activated and the qemu-vmsr-helper. A systemd service and socket
> +activation is provided in contrib/systemd/qemu-vmsr-helper.(service/socket).
> +
> +The systemd socket uses 600, like contrib/systemd/qemu-pr-helper.socket. The
> +socket can be passed via SCM_RIGHTS by libvirt, or its permissions can be
> +changed (e.g. 660 and root:kvm for a Debian system for example). Libvirt could
> +also start a separate helper if needed. All in all, the policy is left to the
> +user.
> +
> +See the qemu-pr-helper documentation or manpage for further details.
> +
> +Current Limitations
> +-------------------
> +
> +- Works only on Intel host CPUs because AMD CPUs are using different MSR
> +  addresses.
> +
> +- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at the
> +  moment.
> +
> +References
> +----------
> +
> +.. [#a] https://patchwork.kernel.org/project/kvm/patch/20200916202951.23760-7-graf@amazon.com/
> +.. [#b] https://man7.org/linux/man-pages/man5/proc.5.html
> diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
> index 3f3d13f81669..1d8fb1473bdf 100644
> --- a/include/sysemu/kvm_int.h
> +++ b/include/sysemu/kvm_int.h
> @@ -14,6 +14,9 @@
>  #include "qemu/accel.h"
>  #include "qemu/queue.h"
>  #include "sysemu/kvm.h"
> +#include "hw/boards.h"
> +#include "hw/i386/topology.h"
> +#include "io/channel-socket.h"


I'm skeptical about pulling in x86 specific headers into generic kvm header,
(it's miracle that it builds at all), and by extension 'board.h' as well

can it be refactored in a way that you won't need to pull in 'board.h'
and avoid using x86 specific code in generic KVM code?
(which also applies to added below KVMHostTopoInfo)


>  typedef struct KVMSlot
>  {
> @@ -50,6 +53,34 @@ typedef struct KVMMemoryListener {
>  
>  #define KVM_MSI_HASHTAB_SIZE    256
>  
> +typedef struct KVMHostTopoInfo {
> +    /* Number of package on the Host */
> +    unsigned int maxpkgs;
> +    /* Number of cpus on the Host */
> +    unsigned int maxcpus;
> +    /* Number of cpus on each different package */
> +    unsigned int *pkg_cpu_count;
> +    /* Each package can have different maxticks */
> +    unsigned int *maxticks;
> +} KVMHostTopoInfo;
> +
> +struct KVMMsrEnergy {
> +    pid_t pid;
> +    bool enable;
> +    char *socket_path;
> +    QIOChannelSocket *sioc;
> +    QemuThread msr_thr;
> +    unsigned int guest_vcpus;
> +    unsigned int guest_vsockets;
> +    X86CPUTopoInfo guest_topo_info;
> +    KVMHostTopoInfo host_topo;
> +    const CPUArchIdList *guest_cpu_list;
> +    uint64_t *msr_value;
> +    uint64_t msr_unit;
> +    uint64_t msr_limit;
> +    uint64_t msr_info;
> +};
> +
>  enum KVMDirtyRingReaperState {
>      KVM_DIRTY_RING_REAPER_NONE = 0,
>      /* The reaper is sleeping */
> @@ -117,6 +148,7 @@ struct KVMState
>      bool kvm_dirty_ring_with_bitmap;
>      uint64_t kvm_eager_split_size;  /* Eager Page Splitting chunk size */
>      struct KVMDirtyRingReaper reaper;
> +    struct KVMMsrEnergy msr_energy;
>      NotifyVmexitOption notify_vmexit;
>      uint32_t notify_window;
>      uint32_t xen_version;
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index ccccb62fc353..c3891c1a6b4e 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -397,6 +397,10 @@ typedef enum X86Seg {
>  #define MSR_IA32_TSX_CTRL		0x122
>  #define MSR_IA32_TSCDEADLINE            0x6e0
>  #define MSR_IA32_PKRS                   0x6e1
> +#define MSR_RAPL_POWER_UNIT             0x00000606
> +#define MSR_PKG_POWER_LIMIT             0x00000610
> +#define MSR_PKG_ENERGY_STATUS           0x00000611
> +#define MSR_PKG_POWER_INFO              0x00000614
>  #define MSR_ARCH_LBR_CTL                0x000014ce
>  #define MSR_ARCH_LBR_DEPTH              0x000014cf
>  #define MSR_ARCH_LBR_FROM_0             0x00001500
> @@ -1790,6 +1794,10 @@ typedef struct CPUArchState {
>  
>      uintptr_t retaddr;
>  
> +    /* RAPL MSR */
> +    uint64_t msr_rapl_power_unit;
> +    uint64_t msr_pkg_energy_status;
> +
>      /* Fields up to this point are cleared by a CPU reset */
>      struct {} end_reset_fields;
>  
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index c5943605ee3a..8767c8e06028 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c

I'd also suggest to move most of rapl related code added here to vmsr_energy.c,
and leave only msr access plumbing here.
(this file is already huge and adding 400+ loc isn't helping its readability at all)

> @@ -16,9 +16,12 @@
>  #include "qapi/qapi-events-run-state.h"
>  #include "qapi/error.h"
>  #include "qapi/visitor.h"
> +#include <math.h>
>  #include <sys/ioctl.h>
>  #include <sys/utsname.h>
>  #include <sys/syscall.h>
> +#include <sys/resource.h>
> +#include <sys/time.h>
>  
>  #include <linux/kvm.h>
>  #include "standard-headers/asm-x86/kvm_para.h"
> @@ -26,6 +29,7 @@
>  
>  #include "cpu.h"
>  #include "host-cpu.h"
> +#include "vmsr_energy.h"
>  #include "sysemu/sysemu.h"
>  #include "sysemu/hw_accel.h"
>  #include "sysemu/kvm_int.h"
> @@ -2519,7 +2523,8 @@ static int kvm_get_supported_msrs(KVMState *s)
>      return ret;
>  }
>  
> -static bool kvm_rdmsr_core_thread_count(X86CPU *cpu, uint32_t msr,
> +static bool kvm_rdmsr_core_thread_count(X86CPU *cpu,
> +                                        uint32_t msr,
>                                          uint64_t *val)
>  {
>      CPUState *cs = CPU(cpu);
> @@ -2530,6 +2535,53 @@ static bool kvm_rdmsr_core_thread_count(X86CPU *cpu, uint32_t msr,
>      return true;
>  }
>  
> +static bool kvm_rdmsr_rapl_power_unit(X86CPU *cpu,
> +                                      uint32_t msr,
> +                                      uint64_t *val)
> +{
> +
> +    CPUState *cs = CPU(cpu);
> +
> +    *val = cs->kvm_state->msr_energy.msr_unit;
> +
> +    return true;
> +}
> +
> +static bool kvm_rdmsr_pkg_power_limit(X86CPU *cpu,
> +                                      uint32_t msr,
> +                                      uint64_t *val)
> +{
> +
> +    CPUState *cs = CPU(cpu);
> +
> +    *val = cs->kvm_state->msr_energy.msr_limit;
> +
> +    return true;
> +}
> +
> +static bool kvm_rdmsr_pkg_power_info(X86CPU *cpu,
> +                                     uint32_t msr,
> +                                     uint64_t *val)
> +{
> +
> +    CPUState *cs = CPU(cpu);
> +
> +    *val = cs->kvm_state->msr_energy.msr_info;
> +
> +    return true;
> +}
> +
> +static bool kvm_rdmsr_pkg_energy_status(X86CPU *cpu,
> +                                        uint32_t msr,
> +                                        uint64_t *val)
> +{
> +
> +    CPUState *cs = CPU(cpu);
> +    *val = cs->kvm_state->msr_energy.msr_value[cs->cpu_index];
> +
> +    return true;
> +}
> +
>  static Notifier smram_machine_done;
>  static KVMMemoryListener smram_listener;
>  static AddressSpace smram_address_space;
> @@ -2564,6 +2616,340 @@ static void register_smram_listener(Notifier *n, void *unused)
>                                   &smram_address_space, 1, "kvm-smram");
>  }
>  
> +static void *kvm_msr_energy_thread(void *data)
> +{
> +    KVMState *s = data;
> +    struct KVMMsrEnergy *vmsr = &s->msr_energy;
> +
> +    g_autofree vmsr_package_energy_stat *pkg_stat = NULL;
> +    g_autofree vmsr_thread_stat *thd_stat = NULL;
> +    g_autofree CPUState *cpu = NULL;
> +    g_autofree unsigned int *vpkgs_energy_stat = NULL;
> +    unsigned int num_threads = 0;
> +
> +    X86CPUTopoIDs topo_ids;
> +
> +    rcu_register_thread();
> +
> +    /* Allocate memory for each package energy status */
> +    pkg_stat = g_new0(vmsr_package_energy_stat, vmsr->host_topo.maxpkgs);
> +
> +    /* Allocate memory for thread stats */
> +    thd_stat = g_new0(vmsr_thread_stat, 1);
> +
> +    /* Allocate memory for holding virtual package energy counter */
> +    vpkgs_energy_stat = g_new0(unsigned int, vmsr->guest_vsockets);
> +
> +    /* Populate the max tick of each packages */
> +    for (int i = 0; i < vmsr->host_topo.maxpkgs; i++) {
> +        /*
> +         * Max numbers of ticks per package
> +         * Time in second * Number of ticks/second * Number of cores/package
> +         * ex: 100 ticks/second/CPU, 12 CPUs per Package gives 1200 ticks max
> +         */
> +        vmsr->host_topo.maxticks[i] = (MSR_ENERGY_THREAD_SLEEP_US / 1000000)
> +                        * sysconf(_SC_CLK_TCK)
> +                        * vmsr->host_topo.pkg_cpu_count[i];
> +    }
> +
> +    while (true) {
> +        /* Get all qemu threads id */
> +        g_autofree pid_t *thread_ids =
> +            thread_ids = vmsr_get_thread_ids(vmsr->pid, &num_threads);
> +
> +        if (thread_ids == NULL) {
> +            goto clean;
> +        }
> +
> +        thd_stat = g_renew(vmsr_thread_stat, thd_stat, num_threads);
> +        /* Unlike g_new0, g_renew0 function doesn't exist yet... */
> +        memset(thd_stat, 0, num_threads * sizeof(vmsr_thread_stat));
> +
> +        /* Populate all the thread stats */
> +        for (int i = 0; i < num_threads; i++) {
> +            thd_stat[i].utime = g_new0(unsigned long long, 2);
> +            thd_stat[i].stime = g_new0(unsigned long long, 2);
> +            thd_stat[i].thread_id = thread_ids[i];
> +            vmsr_read_thread_stat(vmsr->pid,
> +                                  thd_stat[i].thread_id,
> +                                  thd_stat[i].utime,
> +                                  thd_stat[i].stime,
> +                                  &thd_stat[i].cpu_id);
> +            thd_stat[i].pkg_id =
> +                vmsr_get_physical_package_id(thd_stat[i].cpu_id);
> +        }
> +
> +        /* Retrieve all packages power plane energy counter */
> +        for (int i = 0; i < vmsr->host_topo.maxpkgs; i++) {
> +            for (int j = 0; j < num_threads; j++) {
> +                /*
> +                 * Use the first thread we found that ran on the CPU
> +                 * of the package to read the packages energy counter
> +                 */
> +                if (thd_stat[j].pkg_id == i) {
> +                    pkg_stat[i].e_start =
> +                    vmsr_read_msr(MSR_PKG_ENERGY_STATUS,
> +                                  thd_stat[j].cpu_id,
> +                                  thd_stat[j].thread_id,
> +                                  s->msr_energy.sioc);
> +                    break;
> +                }
> +            }
> +        }
> +
> +        /* Sleep a short period while the other threads are working */
> +        usleep(MSR_ENERGY_THREAD_SLEEP_US);
> +
> +        /*
> +         * Retrieve all packages power plane energy counter
> +         * Calculate the delta of all packages
> +         */
> +        for (int i = 0; i < vmsr->host_topo.maxpkgs; i++) {
> +            for (int j = 0; j < num_threads; j++) {
> +                /*
> +                 * Use the first thread we found that ran on the CPU
> +                 * of the package to read the packages energy counter
> +                 */
> +                if (thd_stat[j].pkg_id == i) {
> +                    pkg_stat[i].e_end =
> +                    vmsr_read_msr(MSR_PKG_ENERGY_STATUS,
> +                                  thd_stat[j].cpu_id,
> +                                  thd_stat[j].thread_id,
> +                                  s->msr_energy.sioc);
> +                    /*
> +                     * Prevent the case we have migrate the VM
> +                     * during the sleep period or any other cases
> +                     * were energy counter might be lower after
> +                     * the sleep period.
> +                     */
> +                    if (pkg_stat[i].e_end > pkg_stat[i].e_start) {
> +                        pkg_stat[i].e_delta =
> +                            pkg_stat[i].e_end - pkg_stat[i].e_start;
> +                    } else {
> +                        pkg_stat[i].e_delta = 0;
> +                    }
> +                    break;
> +                }
> +            }
> +        }
> +
> +        /* Delta of ticks spend by each thread between the sample */
> +        for (int i = 0; i < num_threads; i++) {
> +            vmsr_read_thread_stat(vmsr->pid,
> +                                  thd_stat[i].thread_id,
> +                                  thd_stat[i].utime,
> +                                  thd_stat[i].stime,
> +                                  &thd_stat[i].cpu_id);
> +
> +            if (vmsr->pid < 0) {
> +                /*
> +                 * We don't count the dead thread
> +                 * i.e threads that existed before the sleep
> +                 * and not anymore
> +                 */
> +                thd_stat[i].delta_ticks = 0;
> +            } else {
> +                vmsr_delta_ticks(thd_stat, i);
> +            }
> +        }
> +
> +        /*
> +         * Identify the vcpu threads
> +         * Calculate the number of vcpu per package
> +         */
> +        CPU_FOREACH(cpu) {
> +            for (int i = 0; i < num_threads; i++) {
> +                if (cpu->thread_id == thd_stat[i].thread_id) {
> +                    thd_stat[i].is_vcpu = true;
> +                    thd_stat[i].vcpu_id = cpu->cpu_index;
> +                    pkg_stat[thd_stat[i].pkg_id].nb_vcpu++;
> +                    thd_stat[i].acpi_id = kvm_arch_vcpu_id(cpu);
> +                    break;
> +                }
> +            }
> +        }
> +
> +        /* Retrieve the virtual package number of each vCPU */
> +        for (int i = 0; i < vmsr->guest_cpu_list->len; i++) {
> +            for (int j = 0; j < num_threads; j++) {
> +                if ((thd_stat[j].acpi_id ==
> +                        vmsr->guest_cpu_list->cpus[i].arch_id)
> +                    && (thd_stat[j].is_vcpu == true)) {
> +                    x86_topo_ids_from_apicid(thd_stat[j].acpi_id,
> +                        &vmsr->guest_topo_info, &topo_ids);
> +                    thd_stat[j].vpkg_id = topo_ids.pkg_id;
> +                }
> +            }
> +        }
> +
> +        /* Calculate the total energy of all non-vCPU thread */
> +        for (int i = 0; i < num_threads; i++) {
> +            if ((thd_stat[i].is_vcpu != true) &&
> +                (thd_stat[i].delta_ticks > 0)) {
> +                double temp;
> +                temp = vmsr_get_ratio(pkg_stat[thd_stat[i].pkg_id].e_delta,
> +                    thd_stat[i].delta_ticks,
> +                    vmsr->host_topo.maxticks[thd_stat[i].pkg_id]);
> +                pkg_stat[thd_stat[i].pkg_id].e_ratio
> +                    += (uint64_t)lround(temp);
> +            }
> +        }
> +
> +        /* Calculate the ratio per non-vCPU thread of each package */
> +        for (int i = 0; i < vmsr->host_topo.maxpkgs; i++) {
> +            if (pkg_stat[i].nb_vcpu > 0) {
> +                pkg_stat[i].e_ratio = pkg_stat[i].e_ratio / pkg_stat[i].nb_vcpu;
> +            }
> +        }
> +
> +        /*
> +         * Calculate the energy for each Package:
> +         * Energy Package = sum of each vCPU energy that belongs to the package
> +         */
> +        for (int i = 0; i < num_threads; i++) {
> +            if ((thd_stat[i].is_vcpu == true) && \
> +                    (thd_stat[i].delta_ticks > 0)) {
> +                double temp;
> +                temp = vmsr_get_ratio(pkg_stat[thd_stat[i].pkg_id].e_delta,
> +                    thd_stat[i].delta_ticks,
> +                    vmsr->host_topo.maxticks[thd_stat[i].pkg_id]);
> +                vpkgs_energy_stat[thd_stat[i].vpkg_id] +=
> +                    (uint64_t)lround(temp);
> +                vpkgs_energy_stat[thd_stat[i].vpkg_id] +=
> +                    pkg_stat[thd_stat[i].pkg_id].e_ratio;
> +            }
> +        }
> +
> +        /*
> +         * Finally populate the vmsr register of each vCPU with the total
> +         * package value to emulate the real hardware where each CPU return the
> +         * value of the package it belongs.
> +         */
> +        for (int i = 0; i < num_threads; i++) {
> +            if ((thd_stat[i].is_vcpu == true) && \
> +                    (thd_stat[i].delta_ticks > 0)) {
> +                vmsr->msr_value[thd_stat[i].vcpu_id] = \
> +                                        vpkgs_energy_stat[thd_stat[i].vpkg_id];
> +          }
> +        }
> +
> +        /* Freeing memory before zeroing the pointer */
> +        for (int i = 0; i < num_threads; i++) {
> +            g_free(thd_stat[i].utime);
> +            g_free(thd_stat[i].stime);
> +        }
> +   }
> +
> +clean:
> +    rcu_unregister_thread();
> +    return NULL;
> +}
> +
> +static int kvm_msr_energy_thread_init(KVMState *s, MachineState *ms)
> +{
> +    MachineClass *mc = MACHINE_GET_CLASS(ms);
> +    struct KVMMsrEnergy *r = &s->msr_energy;
> +    int ret = 0;
> +
> +    /*
> +     * Sanity check
> +     * 1. Host cpu must be Intel cpu
> +     * 2. RAPL must be enabled on the Host
> +     */
> +    if (is_host_cpu_intel()) {
> +        error_report("The RAPL feature can only be enabled on hosts\
> +                      with Intel CPU models");
> +        ret = 1;
> +        goto out;
> +    }
> +
> +    if (!is_rapl_enabled()) {
> +        ret = 1;
> +        goto out;
> +    }
> +
> +    /* Retrieve the virtual topology */
> +    vmsr_init_topo_info(&r->guest_topo_info, ms);
> +
> +    /* Retrieve the number of vcpu */
> +    r->guest_vcpus = ms->smp.cpus;
> +
> +    /* Retrieve the number of virtual sockets */
> +    r->guest_vsockets = ms->smp.sockets;
> +
> +    /* Allocate register memory (MSR_PKG_STATUS) for each vcpu */
> +    r->msr_value = g_new0(uint64_t, r->guest_vcpus);
> +
> +    /* Retrieve the CPUArchIDlist */
> +    r->guest_cpu_list = mc->possible_cpu_arch_ids(ms);
> +
> +    /* Max number of cpus on the Host */
> +    r->host_topo.maxcpus = vmsr_get_maxcpus();
> +    if (r->host_topo.maxcpus == 0) {
> +        error_report("host max cpus = 0");
> +        ret = 1;
> +        goto out;
> +    }
> +
> +    /* Max number of packages on the host */
> +    r->host_topo.maxpkgs = vmsr_get_max_physical_package(r->host_topo.maxcpus);
> +    if (r->host_topo.maxpkgs == 0) {
> +        error_report("host max pkgs = 0");
> +        ret = 1;
> +        goto out;
> +    }
> +
> +    /* Allocate memory for each package on the host */
> +    r->host_topo.pkg_cpu_count = g_new0(unsigned int, r->host_topo.maxpkgs);
> +    r->host_topo.maxticks = g_new0(unsigned int, r->host_topo.maxpkgs);
> +
> +    vmsr_count_cpus_per_package(r->host_topo.pkg_cpu_count,
> +                                r->host_topo.maxpkgs);
> +    for (int i = 0; i < r->host_topo.maxpkgs; i++) {
> +        if (r->host_topo.pkg_cpu_count[i] == 0) {
> +            error_report("cpu per packages = 0 on package_%d", i);
> +            ret = 1;
> +            goto out;
> +        }
> +    }
> +
> +    /* Get QEMU PID*/
> +    r->pid = getpid();
> +
> +    /* Compute the socket path if necessary */
> +    if (s->msr_energy.socket_path == NULL) {
> +        s->msr_energy.socket_path = vmsr_compute_default_paths();
> +    }
> +
> +    /* Open socket with vmsr helper */
> +    s->msr_energy.sioc = vmsr_open_socket(s->msr_energy.socket_path);
> +
> +    if (s->msr_energy.sioc == NULL) {
> +        error_report("vmsr socket opening failed");
> +        ret = 1;
> +        goto out;
> +    }
> +
> +    /* Those MSR values should not change */
> +    r->msr_unit  = vmsr_read_msr(MSR_RAPL_POWER_UNIT, 0, r->pid,
> +                                    s->msr_energy.sioc);
> +    r->msr_limit = vmsr_read_msr(MSR_PKG_POWER_LIMIT, 0, r->pid,
> +                                    s->msr_energy.sioc);
> +    r->msr_info  = vmsr_read_msr(MSR_PKG_POWER_INFO, 0, r->pid,
> +                                    s->msr_energy.sioc);
> +    if (r->msr_unit == 0 || r->msr_limit == 0 || r->msr_info == 0) {
> +        error_report("can't read any virtual msr");
> +        ret = 1;
> +        goto out;
> +    }
> +
> +    qemu_thread_create(&r->msr_thr, "kvm-msr",
> +                       kvm_msr_energy_thread,
> +                       s, QEMU_THREAD_JOINABLE);
> +out:
> +    return ret;
> +}
> +
>  int kvm_arch_get_default_type(MachineState *ms)
>  {
>      return 0;
> @@ -2768,6 +3154,49 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>                           strerror(-ret));
>              exit(1);
>          }
> +
> +        if (s->msr_energy.enable == true) {
> +            r = kvm_filter_msr(s, MSR_RAPL_POWER_UNIT,
> +                               kvm_rdmsr_rapl_power_unit, NULL);
> +            if (!r) {
> +                error_report("Could not install MSR_RAPL_POWER_UNIT \
> +                                handler: %s",
> +                             strerror(-ret));
> +                exit(1);
> +            }
> +
> +            r = kvm_filter_msr(s, MSR_PKG_POWER_LIMIT,
> +                               kvm_rdmsr_pkg_power_limit, NULL);
> +            if (!r) {
> +                error_report("Could not install MSR_PKG_POWER_LIMIT \
> +                                handler: %s",
> +                             strerror(-ret));
> +                exit(1);
> +            }
> +
> +            r = kvm_filter_msr(s, MSR_PKG_POWER_INFO,
> +                               kvm_rdmsr_pkg_power_info, NULL);
> +            if (!r) {
> +                error_report("Could not install MSR_PKG_POWER_INFO \
> +                                handler: %s",
> +                             strerror(-ret));
> +                exit(1);
> +            }
> +            r = kvm_filter_msr(s, MSR_PKG_ENERGY_STATUS,
> +                               kvm_rdmsr_pkg_energy_status, NULL);
> +            if (!r) {
> +                error_report("Could not install MSR_PKG_ENERGY_STATUS \
> +                                handler: %s",
> +                             strerror(-ret));
> +                exit(1);
> +            }
> +            r = kvm_msr_energy_thread_init(s, ms);
> +            if (r) {
> +                error_report("kvm : error RAPL feature requirement not meet");
> +                exit(1);
> +            }
> +
> +        }
>      }
>  
>      return 0;
> diff --git a/target/i386/kvm/meson.build b/target/i386/kvm/meson.build
> index e7850981e62d..3996cafaf29f 100644
> --- a/target/i386/kvm/meson.build
> +++ b/target/i386/kvm/meson.build
> @@ -3,6 +3,7 @@ i386_kvm_ss = ss.source_set()
>  i386_kvm_ss.add(files(
>    'kvm.c',
>    'kvm-cpu.c',
> +  'vmsr_energy.c',
>  ))
>  
>  i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen-emu.c'))
> diff --git a/target/i386/kvm/vmsr_energy.c b/target/i386/kvm/vmsr_energy.c
> new file mode 100644
> index 000000000000..acf0fc0a2fb3
> --- /dev/null
> +++ b/target/i386/kvm/vmsr_energy.c
> @@ -0,0 +1,344 @@
> +/*
> + * QEMU KVM support -- x86 virtual RAPL msr
> + *
> + * Copyright 2024 Red Hat, Inc. 2024
> + *
> + *  Author:
> + *      Anthony Harivel <aharivel@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/error-report.h"
> +#include "vmsr_energy.h"
> +#include "io/channel.h"
> +#include "io/channel-socket.h"
> +#include "hw/boards.h"
> +#include "cpu.h"
> +#include "host-cpu.h"
> +
> +char *vmsr_compute_default_paths(void)
> +{
> +    g_autofree char *state = qemu_get_local_state_dir();
> +
> +    return g_build_filename(state, "run", "qemu-vmsr-helper.sock", NULL);
> +}
> +
> +bool is_host_cpu_intel(void)
> +{
> +    int family, model, stepping;
> +    char vendor[CPUID_VENDOR_SZ + 1];
> +
> +    host_cpu_vendor_fms(vendor, &family, &model, &stepping);
> +
> +    return strcmp(vendor, CPUID_VENDOR_INTEL);
> +}
> +
> +int is_rapl_enabled(void)
> +{
> +    const char *path = "/sys/class/powercap/intel-rapl/enabled";
> +    FILE *file = fopen(path, "r");
> +    int value = 0;
> +
> +    if (file != NULL) {
> +        if (fscanf(file, "%d", &value) != 1) {
> +            error_report("INTEL RAPL not enabled");
> +        }
> +        fclose(file);
> +    } else {
> +        error_report("Error opening %s", path);
> +    }
> +
> +    return value;
> +}
> +
> +QIOChannelSocket *vmsr_open_socket(const char *path)
> +{
> +    g_autofree char *socket_path = NULL;
> +
> +    socket_path = g_strdup(path);
> +
> +    SocketAddress saddr = {
> +        .type = SOCKET_ADDRESS_TYPE_UNIX,
> +        .u.q_unix.path = socket_path
> +    };
> +
> +    QIOChannelSocket *sioc = qio_channel_socket_new();
> +    Error *local_err = NULL;
> +
> +    qio_channel_set_name(QIO_CHANNEL(sioc), "vmsr-helper");
> +    qio_channel_socket_connect_sync(sioc,
> +                                    &saddr,
> +                                    &local_err);
> +    if (local_err) {
> +        /* Close socket. */
> +        qio_channel_close(QIO_CHANNEL(sioc), NULL);
> +        object_unref(OBJECT(sioc));
> +        sioc = NULL;
> +        goto out;
> +    }
> +
> +    qio_channel_set_delay(QIO_CHANNEL(sioc), false);
> +out:
> +    return sioc;
> +}
> +
> +uint64_t vmsr_read_msr(uint32_t reg, uint32_t cpu_id, uint32_t tid,
> +                       QIOChannelSocket *sioc)
> +{
> +    uint64_t data = 0;
> +    int r = 0;
> +    Error *local_err = NULL;
> +    uint32_t buffer[3];
> +    /*
> +     * Send the required arguments:
> +     * 1. RAPL MSR register to read
> +     * 2. On which CPU ID
> +     * 3. From which vCPU (Thread ID)
> +     */
> +    buffer[0] = reg;
> +    buffer[1] = cpu_id;
> +    buffer[2] = tid;
> +
> +    r = qio_channel_write_all(QIO_CHANNEL(sioc),
> +                              (char *)buffer, sizeof(buffer),
> +                              &local_err);
> +    if (r < 0) {
> +        goto out_close;
> +    }
> +
> +    r = qio_channel_read(QIO_CHANNEL(sioc),
> +                             (char *)&data, sizeof(data),
> +                             &local_err);
> +    if (r < 0) {
> +        data = 0;
> +        goto out_close;
> +    }
> +
> +out_close:
> +   return data;
> +}
> +
> +/* Retrieve the max number of physical package */
> +unsigned int vmsr_get_max_physical_package(unsigned int max_cpus)
> +{
> +    const char *dir = "/sys/devices/system/cpu/";
> +    const char *topo_path = "topology/physical_package_id";
> +    g_autofree int *uniquePackages = g_new0(int, max_cpus);
> +    unsigned int packageCount = 0;
> +    FILE *file = NULL;
> +
> +    for (int i = 0; i < max_cpus; i++) {
> +        g_autofree char *filePath = NULL;
> +        g_autofree char *cpuid = g_strdup_printf("cpu%d", i);
> +
> +        filePath = g_build_filename(dir, cpuid, topo_path, NULL);
> +
> +        file = fopen(filePath, "r");
> +
> +        if (file == NULL) {
> +            error_report("Error opening physical_package_id file");
> +            return 0;
> +        }
> +
> +        char packageId[10];
> +        if (fgets(packageId, sizeof(packageId), file) == NULL) {
> +            packageCount = 0;
> +        }
> +
> +        fclose(file);
> +
> +        int currentPackageId = atoi(packageId);
> +
> +        bool isUnique = true;
> +        for (int j = 0; j < packageCount; j++) {
> +            if (uniquePackages[j] == currentPackageId) {
> +                isUnique = false;
> +                break;
> +            }
> +        }
> +
> +        if (isUnique) {
> +            uniquePackages[packageCount] = currentPackageId;
> +            packageCount++;
> +
> +            if (packageCount >= max_cpus) {
> +                break;
> +            }
> +        }
> +    }
> +
> +    return (packageCount == 0) ? 1 : packageCount;
> +}
> +
> +/* Retrieve the max number of physical cpu on the host */
> +unsigned int vmsr_get_maxcpus(void)
> +{
> +    GDir *dir;
> +    const gchar *entry_name;
> +    unsigned int cpu_count = 0;
> +    const char *path = "/sys/devices/system/cpu/";
> +
> +    dir = g_dir_open(path, 0, NULL);
> +    if (dir == NULL) {
> +        error_report("Unable to open cpu directory");
> +        return -1;
> +    }
> +
> +    while ((entry_name = g_dir_read_name(dir)) != NULL) {
> +        if (g_ascii_strncasecmp(entry_name, "cpu", 3) == 0 &&
> +            isdigit(entry_name[3])) {
> +            cpu_count++;
> +        }
> +    }
> +
> +    g_dir_close(dir);
> +
> +    return cpu_count;
> +}
> +
> +/* Count the number of physical cpu on each packages */
> +unsigned int vmsr_count_cpus_per_package(unsigned int *package_count,
> +                                         unsigned int max_pkgs)
> +{
> +    g_autofree char *file_contents = NULL;
> +    g_autofree char *path = NULL;
> +    g_autofree char *path_name = NULL;
> +    gsize length;
> +
> +    /* Iterate over cpus and count cpus in each package */
> +    for (int cpu_id = 0; ; cpu_id++) {
> +        path_name = g_strdup_printf("/sys/devices/system/cpu/cpu%d/"
> +            "topology/physical_package_id", cpu_id);
> +
> +        path = g_build_filename(path_name, NULL);
> +
> +        if (!g_file_get_contents(path, &file_contents, &length, NULL)) {
> +            break; /* No more cpus */
> +        }
> +
> +        /* Get the physical package ID for this CPU */
> +        int package_id = atoi(file_contents);
> +
> +        /* Check if the package ID is within the known number of packages */
> +        if (package_id >= 0 && package_id < max_pkgs) {
> +            /* If yes, count the cpu for this package*/
> +            package_count[package_id]++;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +/* Get the physical package id from a given cpu id */
> +int vmsr_get_physical_package_id(int cpu_id)
> +{
> +    g_autofree char *file_contents = NULL;
> +    g_autofree char *file_path = NULL;
> +    int package_id = -1;
> +    gsize length;
> +
> +    file_path = g_strdup_printf("/sys/devices/system/cpu/cpu%d"
> +        "/topology/physical_package_id", cpu_id);
> +
> +    if (!g_file_get_contents(file_path, &file_contents, &length, NULL)) {
> +        goto out;
> +    }
> +
> +    package_id = atoi(file_contents);
> +
> +out:
> +    return package_id;
> +}
> +
> +/* Read the scheduled time for a given thread of a give pid */
> +void vmsr_read_thread_stat(pid_t pid,
> +                      unsigned int thread_id,
> +                      unsigned long long *utime,
> +                      unsigned long long *stime,
> +                      unsigned int *cpu_id)
> +{
> +    g_autofree char *path = NULL;
> +    g_autofree char *path_name = NULL;
> +
> +    path_name = g_strdup_printf("/proc/%u/task/%d/stat", pid, thread_id);
> +
> +    path = g_build_filename(path_name, NULL);
> +
> +    FILE *file = fopen(path, "r");
> +    if (file == NULL) {
> +        pid = -1;
> +        return;
> +    }
> +
> +    if (fscanf(file, "%*d (%*[^)]) %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u"
> +        " %llu %llu %*d %*d %*d %*d %*d %*d %*u %*u %*d %*u %*u"
> +        " %*u %*u %*u %*u %*u %*u %*u %*u %*u %*d %*u %*u %u",
> +           utime, stime, cpu_id) != 3)
> +    {
> +        pid = -1;
> +        return;
> +    }
> +
> +    fclose(file);
> +    return;
> +}
> +
> +/* Read QEMU stat task folder to retrieve all QEMU threads ID */
> +pid_t *vmsr_get_thread_ids(pid_t pid, unsigned int *num_threads)
> +{
> +    g_autofree char *task_path = g_strdup_printf("%d/task", pid);
> +    g_autofree char *path = g_build_filename("/proc", task_path, NULL);
> +
> +    DIR *dir = opendir(path);
> +    if (dir == NULL) {
> +        error_report("Error opening /proc/qemu/task");
> +        return NULL;
> +    }
> +
> +    pid_t *thread_ids = NULL;
> +    unsigned int thread_count = 0;
> +
> +    g_autofree struct dirent *ent = NULL;
> +    while ((ent = readdir(dir)) != NULL) {
> +        if (ent->d_name[0] == '.') {
> +            continue;
> +        }
> +        pid_t tid = atoi(ent->d_name);
> +        if (pid != tid) {
> +            thread_ids = g_renew(pid_t, thread_ids, (thread_count + 1));
> +            thread_ids[thread_count] = tid;
> +            thread_count++;
> +        }
> +    }
> +
> +    closedir(dir);
> +
> +    *num_threads = thread_count;
> +    return thread_ids;
> +}
> +
> +void vmsr_delta_ticks(vmsr_thread_stat *thd_stat, int i)
> +{
> +    thd_stat[i].delta_ticks = (thd_stat[i].utime[1] + thd_stat[i].stime[1])
> +                            - (thd_stat[i].utime[0] + thd_stat[i].stime[0]);
> +}
> +
> +double vmsr_get_ratio(uint64_t e_delta,
> +                      unsigned long long delta_ticks,
> +                      unsigned int maxticks)
> +{
> +    return (e_delta / 100.0) * ((100.0 / maxticks) * delta_ticks);
> +}
> +
> +void vmsr_init_topo_info(X86CPUTopoInfo *topo_info,
> +                           const MachineState *ms)
> +{
> +    topo_info->dies_per_pkg = ms->smp.dies;
> +    topo_info->cores_per_die = ms->smp.cores;
> +    topo_info->threads_per_core = ms->smp.threads;
> +}
> +
> diff --git a/target/i386/kvm/vmsr_energy.h b/target/i386/kvm/vmsr_energy.h
> new file mode 100644
> index 000000000000..16cc1f4814f6
> --- /dev/null
> +++ b/target/i386/kvm/vmsr_energy.h
> @@ -0,0 +1,99 @@
> +/*
> + * QEMU KVM support -- x86 virtual energy-related MSR.
> + *
> + * Copyright 2024 Red Hat, Inc. 2024
> + *
> + *  Author:
> + *      Anthony Harivel <aharivel@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef VMSR_ENERGY_H
> +#define VMSR_ENERGY_H
> +
> +#include <stdint.h>
> +#include "qemu/osdep.h"
> +#include "io/channel-socket.h"
> +#include "hw/i386/topology.h"
> +
> +/*
> + * Define the interval time in micro seconds between 2 samples of
> + * energy related MSRs
> + */
> +#define MSR_ENERGY_THREAD_SLEEP_US 1000000.0
> +
> +/*
> + * Thread statistic
> + * @ thread_id: TID (thread ID)
> + * @ is_vcpu: true if TID is vCPU thread
> + * @ cpu_id: CPU number last executed on
> + * @ pkg_id: package number of the CPU
> + * @ vcpu_id: vCPU ID
> + * @ vpkg: virtual package number
> + * @ acpi_id: APIC id of the vCPU
> + * @ utime: amount of clock ticks the thread
> + *          has been scheduled in User mode
> + * @ stime: amount of clock ticks the thread
> + *          has been scheduled in System mode
> + * @ delta_ticks: delta of utime+stime between
> + *          the two samples (before/after sleep)
> + */
> +struct vmsr_thread_stat {
> +    unsigned int thread_id;
> +    bool is_vcpu;
> +    unsigned int cpu_id;
> +    unsigned int pkg_id;
> +    unsigned int vpkg_id;
> +    unsigned int vcpu_id;
> +    unsigned long acpi_id;
> +    unsigned long long *utime;
> +    unsigned long long *stime;
> +    unsigned long long delta_ticks;
> +};
> +
> +/*
> + * Package statistic
> + * @ e_start: package energy counter before the sleep
> + * @ e_end: package energy counter after the sleep
> + * @ e_delta: delta of package energy counter
> + * @ e_ratio: store the energy ratio of non-vCPU thread
> + * @ nb_vcpu: number of vCPU running on this package
> + */
> +struct vmsr_package_energy_stat {
> +    uint64_t e_start;
> +    uint64_t e_end;
> +    uint64_t e_delta;
> +    uint64_t e_ratio;
> +    unsigned int nb_vcpu;
> +};
> +
> +typedef struct vmsr_thread_stat vmsr_thread_stat;
> +typedef struct vmsr_package_energy_stat vmsr_package_energy_stat;
> +
> +char *vmsr_compute_default_paths(void);
> +void vmsr_read_thread_stat(pid_t pid,
> +                      unsigned int thread_id,
> +                      unsigned long long *utime,
> +                      unsigned long long *stime,
> +                      unsigned int *cpu_id);
> +
> +QIOChannelSocket *vmsr_open_socket(const char *path);
> +uint64_t vmsr_read_msr(uint32_t reg, uint32_t cpu_id,
> +                       uint32_t tid, QIOChannelSocket *sioc);
> +void vmsr_delta_ticks(vmsr_thread_stat *thd_stat, int i);
> +unsigned int vmsr_get_maxcpus(void);
> +unsigned int vmsr_get_max_physical_package(unsigned int max_cpus);
> +unsigned int vmsr_count_cpus_per_package(unsigned int *package_count,
> +                                         unsigned int max_pkgs);
> +int vmsr_get_physical_package_id(int cpu_id);
> +pid_t *vmsr_get_thread_ids(pid_t pid, unsigned int *num_threads);
> +double vmsr_get_ratio(uint64_t e_delta,
> +                      unsigned long long delta_ticks,
> +                      unsigned int maxticks);
> +void vmsr_init_topo_info(X86CPUTopoInfo *topo_info, const MachineState *ms);
> +bool is_host_cpu_intel(void);
> +int is_rapl_enabled(void);
> +#endif /* VMSR_ENERGY_H */



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 0/3] Add support for the RAPL MSRs series
  2024-10-16 11:52 ` Igor Mammedov
@ 2024-10-16 12:56   ` Anthony Harivel
  2024-10-18 12:25     ` Igor Mammedov
  0 siblings, 1 reply; 25+ messages in thread
From: Anthony Harivel @ 2024-10-16 12:56 UTC (permalink / raw)
  To: Igor Mammedov; +Cc: pbonzini, mtosatti, berrange, qemu-devel, vchundur, rjarry

Hi Igor,

Igor Mammedov, Oct 16, 2024 at 13:52:
> On Wed, 22 May 2024 17:34:49 +0200
> Anthony Harivel <aharivel@redhat.com> wrote:
>
>> Dear maintainers, 
>> 
>> First of all, thank you very much for your review of my patch 
>> [1].
>
> I've tried to play with this feature and have a few questions about it
>

Thanks for testing this new feature. 

>  1. trying to start with non accessible or not existent socket
>         -accel kvm,rapl=on,rapl-helper-socket=/tmp/socket 
>     I get:
>       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: vmsr socket opening failed
>       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: kvm : error RAPL feature requirement not met
>     * is it possible to report actual OS error that happened during open/connect,
>       instead of unhelpful 'socket opening failed'?
>
>       What I see in vmsr_open_socket() error is ignored
>       and btw it's error leak as well
>

Shame you missed the 6 iterations of that patch that last for a year. 
I would have changed that directly !
Anyway I take note on that comment and will send a modification.

>     * 2nd line shouldn't be there if the 1st error already present.
>
>  2.  getting periodic error on console where QEMU has been starter
>       # ./qemu-vmsr-helper -k /tmp/sock
>      ./qemu-system-x86_64 -snapshot -m 4G -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock rhel90.img  -vnc :0 -cpu host
>      and let it run
>
>       it appears rdmsr works (well, it returns some values at least)
>       however there are recurring errors in qemu's stderr(or out)
>       
>       qemu-system-x86_64: Error opening /proc/2496093/task/2496109/stat
>       qemu-system-x86_64: Error opening /proc/2496093/task/2496095/stat
>
>       My guess it's some temporary threads, that come and go, but still
>       they shouldn't cause errors if it's normal operation.
>

There a patch in WIP that change this into a Tracepoint. Maybe you can 
SSH to the VM in meanwhile ?

>       Also on daemon side, I a few times got while guest was running:
>         qemu-vmsr-helper: Failed to open /proc at /proc/2496026/task/2496044
>         qemu-vmsr-helper: Requested TID not in peer PID: 2496026 2496044
>       though I can't reproduce it reliably

This could happen only when a vCPU thread ID has changed between the 
call of a rdmsr throught the socket and the hepler that read the msr.
No idea how a vCPU can change TID or shutdown that fast.

>
>  3. when starting daemon not as root, it starts 'fine' but later on complains
>       qemu-vmsr-helper: Failed to open MSR file at /dev/cpu/0/msr
>     perhaps it would be better to fail at start daemon if it doesn't have
>     access to necessary files.
>

Right taking a note on that as well.


>  4. in case #3, guest also fails to start with errors:
>       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: can't read any virtual msr
>       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: kvm : error RAPL feature requirement not met
>      again line #2 is not useful and probably not needed (maybe make it tracepoint)
>      and #1 is unhelpful - it would be better if it directed user to check qemu-vmsr-helper
>

I will try to see how to improve that part. 
Thanks for your valuable feedback.

>  5. does AMD have similar MSRs that we could use to make this feature complete?
>

Yes but the address are completely different. However, this in my ToDo 
list. First I need way more feedback like yours to move on extending 
this feature.

>  6. What happens to power accounting if host constantly migrates
>     vcpus between sockets, are values we are getting still correct/meaningful?
>     Or do we need to pin vcpus to get 'accurate' values?
>

It's taken into account during the ratio calculation which socket the 
vCPU has just been scheduled. But yes the value are more 'accurate' when 
the vCPU is pinned.

>  7. do we have to have a dedicated thread for pooling data from daemon?
>
>     Can we fetch data from vcpu thread that have accessed msr
>     (with some caching and rate limiting access to the daemon)?
>

This feature is revolving around a thread. Please look at the 
documentation is not already done:

https://www.qemu.org/docs/master/specs/rapl-msr.html#high-level-implementation

If we only fetch from vCPU thread, we won't have the consumption of the 
non-vcpu thread. They are taken into account in the total.



Thanks again for your feedback. 

Anthony


>> In this version (v6), I have attempted to address all the problems 
>> addressed by Daniel and Paolo during the last review. 
>> 
>> However, two open questions remains unanswered that would require the 
>> attention of a x86 maintainers: 
>> 
>> 1)Should I move from -kvm to -cpu the rapl feature ? [2]
>> 
>> 2)Should I already rename to "rapl_vmsr_*" in order to anticipate the 
>>   futur TMPI architecture ? [end of 3] 
>> 
>> Thank you again for your continued guidance. 
>> 
>> v5 -> v6
>> --------
>> - Better error consistency in qio_channel_get_peerpid()
>> - Memory leak g_strdup_printf/g_build_filename corrected
>> - Renaming several struct with "vmsr_*" for better namespace
>> - Renamed several struct with "guest_*" for better comprehension
>> - Optimization suggerate from Daniel
>> - Crash problem solved [4]
>> 
>> v4 -> v5
>> --------
>> 
>> - correct qio_channel_get_peerpid: return pid = -1 in case of error
>> - Vmsr_helper: compile only for x86
>> - Vmsr_helper: use qio_channel_read/write_all
>> - Vmsr_helper: abandon user/group
>> - Vmsr_energy.c: correct all error_report
>> - Vmsr thread: compute default socket path only once
>> - Vmsr thread: open socket only once
>> - Pass relevant QEMU CI
>> 
>> v3 -> v4
>> --------
>> 
>> - Correct memory leaks with AddressSanitizer  
>> - Add sanity check for QEMU and qemu-vmsr-helper for checking if host is 
>>   INTEL and if RAPL is activated.
>> - Rename poor variables naming for easier comprehension
>> - Move code that checks Host before creating the VMSR thread
>> - Get rid of libnuma: create function that read sysfs for reading the 
>>   Host topology instead
>> 
>> v2 -> v3
>> --------
>> 
>> - Move all memory allocations from Clib to Glib
>> - Compile on *BSD (working on Linux only)
>> - No more limitation on the virtual package: each vCPU that belongs to 
>>   the same virtual package is giving the same results like expected on 
>>   a real CPU.
>>   This has been tested topology like:
>>      -smp 4,sockets=2
>>      -smp 16,sockets=4,cores=2,threads=2
>> 
>> v1 -> v2
>> --------
>> 
>> - To overcome the CVE-2020-8694 a socket communication is created
>>   to a priviliged helper
>> - Add the priviliged helper (qemu-vmsr-helper)
>> - Add SO_PEERCRED in qio channel socket
>> 
>> RFC -> v1
>> ---------
>> 
>> - Add vmsr_* in front of all vmsr specific function
>> - Change malloc()/calloc()... with all glib equivalent
>> - Pre-allocate all dynamic memories when possible
>> - Add a Documentation of implementation, limitation and usage
>> 
>> Best regards,
>> Anthony
>> 
>> [1]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg01570.html
>> [2]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg03947.html
>> [3]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02350.html
>> [4]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02481.html
>> 
>> Anthony Harivel (3):
>>   qio: add support for SO_PEERCRED for socket channel
>>   tools: build qemu-vmsr-helper
>>   Add support for RAPL MSRs in KVM/Qemu
>> 
>>  accel/kvm/kvm-all.c                      |  27 ++
>>  contrib/systemd/qemu-vmsr-helper.service |  15 +
>>  contrib/systemd/qemu-vmsr-helper.socket  |   9 +
>>  docs/specs/index.rst                     |   1 +
>>  docs/specs/rapl-msr.rst                  | 155 +++++++
>>  docs/tools/index.rst                     |   1 +
>>  docs/tools/qemu-vmsr-helper.rst          |  89 ++++
>>  include/io/channel.h                     |  21 +
>>  include/sysemu/kvm_int.h                 |  32 ++
>>  io/channel-socket.c                      |  28 ++
>>  io/channel.c                             |  13 +
>>  meson.build                              |   7 +
>>  target/i386/cpu.h                        |   8 +
>>  target/i386/kvm/kvm.c                    | 431 +++++++++++++++++-
>>  target/i386/kvm/meson.build              |   1 +
>>  target/i386/kvm/vmsr_energy.c            | 337 ++++++++++++++
>>  target/i386/kvm/vmsr_energy.h            |  99 +++++
>>  tools/i386/qemu-vmsr-helper.c            | 530 +++++++++++++++++++++++
>>  tools/i386/rapl-msr-index.h              |  28 ++
>>  19 files changed, 1831 insertions(+), 1 deletion(-)
>>  create mode 100644 contrib/systemd/qemu-vmsr-helper.service
>>  create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
>>  create mode 100644 docs/specs/rapl-msr.rst
>>  create mode 100644 docs/tools/qemu-vmsr-helper.rst
>>  create mode 100644 target/i386/kvm/vmsr_energy.c
>>  create mode 100644 target/i386/kvm/vmsr_energy.h
>>  create mode 100644 tools/i386/qemu-vmsr-helper.c
>>  create mode 100644 tools/i386/rapl-msr-index.h
>> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-10-16 12:17   ` Igor Mammedov
@ 2024-10-16 13:04     ` Anthony Harivel
  0 siblings, 0 replies; 25+ messages in thread
From: Anthony Harivel @ 2024-10-16 13:04 UTC (permalink / raw)
  To: Igor Mammedov; +Cc: pbonzini, mtosatti, berrange, qemu-devel, vchundur, rjarry

Hi Igor,

I will let Paolo or Daniel answering those architecture questions, they 
are more qualified than me. 

Thanks
Anthony


Igor Mammedov, Oct 16, 2024 at 14:17:
> On Wed, 22 May 2024 17:34:52 +0200
> Anthony Harivel <aharivel@redhat.com> wrote:
>
>> Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
>> interface (Running Average Power Limit) for advertising the accumulated
>> energy consumption of various power domains (e.g. CPU packages, DRAM,
>> etc.).
>> 
>> The consumption is reported via MSRs (model specific registers) like
>> MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
>> 64 bits registers that represent the accumulated energy consumption in
>> micro Joules. They are updated by microcode every ~1ms.
>> 
>> For now, KVM always returns 0 when the guest requests the value of
>> these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
>> these MSRs dynamically in userspace.
>> 
>> To limit the amount of system calls for every MSR call, create a new
>> thread in QEMU that updates the "virtual" MSR values asynchronously.
>> 
>> Each vCPU has its own vMSR to reflect the independence of vCPUs. The
>> thread updates the vMSR values with the ratio of energy consumed of
>> the whole physical CPU package the vCPU thread runs on and the
>> thread's utime and stime values.
>> 
>> All other non-vCPU threads are also taken into account. Their energy
>> consumption is evenly distributed among all vCPUs threads running on
>> the same physical CPU package.
>> 
>> To overcome the problem that reading the RAPL MSR requires priviliged
>> access, a socket communication between QEMU and the qemu-vmsr-helper is
>> mandatory. You can specified the socket path in the parameter.
>> 
>> This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock
>> 
>> Actual limitation:
>> - Works only on Intel host CPU because AMD CPUs are using different MSR
>>   adresses.
>> 
>> - Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at
>>   the moment.
>> 
>> Signed-off-by: Anthony Harivel <aharivel@redhat.com>
>> ---
>>  accel/kvm/kvm-all.c           |  27 +++
>>  docs/specs/index.rst          |   1 +
>>  docs/specs/rapl-msr.rst       | 155 ++++++++++++
>>  include/sysemu/kvm_int.h      |  32 +++
>>  target/i386/cpu.h             |   8 +
>>  target/i386/kvm/kvm.c         | 431 +++++++++++++++++++++++++++++++++-
>>  target/i386/kvm/meson.build   |   1 +
>>  target/i386/kvm/vmsr_energy.c | 344 +++++++++++++++++++++++++++
>>  target/i386/kvm/vmsr_energy.h |  99 ++++++++
>>  9 files changed, 1097 insertions(+), 1 deletion(-)
>>  create mode 100644 docs/specs/rapl-msr.rst
>>  create mode 100644 target/i386/kvm/vmsr_energy.c
>>  create mode 100644 target/i386/kvm/vmsr_energy.h
>> 
>
>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index c0be9f5eedb8..f455e6b987b4 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -3745,6 +3745,21 @@ static void kvm_set_device(Object *obj,
>>      s->device = g_strdup(value);
>>  }
>>  
>> +static void kvm_set_kvm_rapl(Object *obj, bool value, Error **errp)
>> +{
>> +    KVMState *s = KVM_STATE(obj);
>> +    s->msr_energy.enable = value;
>> +}
>> +
>> +static void kvm_set_kvm_rapl_socket_path(Object *obj,
>> +                                         const char *str,
>> +                                         Error **errp)
>> +{
>> +    KVMState *s = KVM_STATE(obj);
>> +    g_free(s->msr_energy.socket_path);
>> +    s->msr_energy.socket_path = g_strdup(str);
>> +}
>> +
>>  static void kvm_accel_instance_init(Object *obj)
>>  {
>>      KVMState *s = KVM_STATE(obj);
>> @@ -3764,6 +3779,7 @@ static void kvm_accel_instance_init(Object *obj)
>>      s->xen_gnttab_max_frames = 64;
>>      s->xen_evtchn_max_pirq = 256;
>>      s->device = NULL;
>> +    s->msr_energy.enable = false;
>>  }
>>  
>>  /**
>> @@ -3808,6 +3824,17 @@ static void kvm_accel_class_init(ObjectClass *oc, void *data)
>>      object_class_property_set_description(oc, "device",
>>          "Path to the device node to use (default: /dev/kvm)");
>>  
>> +    object_class_property_add_bool(oc, "rapl",
>> +                                   NULL,
>> +                                   kvm_set_kvm_rapl);
>> +    object_class_property_set_description(oc, "rapl",
>> +        "Allow energy related MSRs for RAPL interface in Guest");
>> +
>> +    object_class_property_add_str(oc, "rapl-helper-socket", NULL,
>> +                                  kvm_set_kvm_rapl_socket_path);
>> +    object_class_property_set_description(oc, "rapl-helper-socket",
>> +        "Socket Path for comminucating with the Virtual MSR helper daemon");
>> +
>>      kvm_arch_accel_class_init(oc);
>>  }
>
> it seems, RAPL is x86 specific feature, so why it is in generic KVM code instead of
> target/i386/kvm/kvm.c: kvm_arch_accel_class_init()
>
>>  
>> diff --git a/docs/specs/index.rst b/docs/specs/index.rst
>> index 1484e3e76077..e738ea7d102f 100644
>> --- a/docs/specs/index.rst
>> +++ b/docs/specs/index.rst
>> @@ -33,3 +33,4 @@ guest hardware that is specific to QEMU.
>>     virt-ctlr
>>     vmcoreinfo
>>     vmgenid
>> +   rapl-msr
>> diff --git a/docs/specs/rapl-msr.rst b/docs/specs/rapl-msr.rst
>> new file mode 100644
>> index 000000000000..1202ee89bee0
>> --- /dev/null
>> +++ b/docs/specs/rapl-msr.rst
>> @@ -0,0 +1,155 @@
>> +================
>> +RAPL MSR support
>> +================
>> +
>> +The RAPL interface (Running Average Power Limit) is advertising the accumulated
>> +energy consumption of various power domains (e.g. CPU packages, DRAM, etc.).
>> +
>> +The consumption is reported via MSRs (model specific registers) like
>> +MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are 64 bits
>> +registers that represent the accumulated energy consumption in micro Joules.
>> +
>> +Thanks to the MSR Filtering patch [#a]_ not all MSRs are handled by KVM. Some
>> +of them can now be handled by the userspace (QEMU). It uses a mechanism called
>> +"MSR filtering" where a list of MSRs is given at init time of a VM to KVM so
>> +that a callback is put in place. The design of this patch uses only this
>> +mechanism for handling the MSRs between guest/host.
>> +
>> +At the moment the following MSRs are involved:
>> +
>> +.. code:: C
>> +
>> +    #define MSR_RAPL_POWER_UNIT             0x00000606
>> +    #define MSR_PKG_POWER_LIMIT             0x00000610
>> +    #define MSR_PKG_ENERGY_STATUS           0x00000611
>> +    #define MSR_PKG_POWER_INFO              0x00000614
>> +
>> +The ``*_POWER_UNIT``, ``*_POWER_LIMIT``, ``*_POWER INFO`` are part of the RAPL
>> +spec and specify the power limit of the package, provide range of parameter(min
>> +power, max power,..) and also the information of the multiplier for the energy
>> +counter to calculate the power. Those MSRs are populated once at the beginning
>> +by reading the host CPU MSRs and are given back to the guest 1:1 when
>> +requested.
>> +
>> +The MSR_PKG_ENERGY_STATUS is a counter; it represents the total amount of
>> +energy consumed since the last time the register was cleared. If you multiply
>> +it with the UNIT provided above you'll get the power in micro-joules. This
>> +counter is always increasing and it increases more or less faster depending on
>> +the consumption of the package. This counter is supposed to overflow at some
>> +point.
>> +
>> +Each core belonging to the same Package reading the MSR_PKG_ENERGY_STATUS (i.e
>> +"rdmsr 0x611") will retrieve the same value. The value represents the energy
>> +for the whole package. Whatever Core reading it will get the same value and a
>> +core that belongs to PKG-0 will not be able to get the value of PKG-1 and
>> +vice-versa.
>> +
>> +High level implementation
>> +-------------------------
>> +
>> +In order to update the value of the virtual MSR, a QEMU thread is created.
>> +The thread is basically just an infinity loop that does:
>> +
>> +1. Snapshot of the time metrics of all QEMU threads (Time spent scheduled in
>> +   Userspace and System)
>> +
>> +2. Snapshot of the actual MSR_PKG_ENERGY_STATUS counter of all packages where
>> +   the QEMU threads are running on.
>> +
>> +3. Sleep for 1 second - During this pause the vcpu and other non-vcpu threads
>> +   will do what they have to do and so the energy counter will increase.
>> +
>> +4. Repeat 2. and 3. and calculate the delta of every metrics representing the
>> +   time spent scheduled for each QEMU thread *and* the energy spent by the
>> +   packages during the pause.
>> +
>> +5. Filter the vcpu threads and the non-vcpu threads.
>> +
>> +6. Retrieve the topology of the Virtual Machine. This helps identify which
>> +   vCPU is running on which virtual package.
>> +
>> +7. The total energy spent by the non-vcpu threads is divided by the number
>> +   of vcpu threads so that each vcpu thread will get an equal part of the
>> +   energy spent by the QEMU workers.
>> +
>> +8. Calculate the ratio of energy spent per vcpu threads.
>> +
>> +9. Calculate the energy for each virtual package.
>> +
>> +10. The virtual MSRs are updated for each virtual package. Each vCPU that
>> +    belongs to the same package will return the same value when accessing the
>> +    the MSR.
>> +
>> +11. Loop back to 1.
>> +
>> +Ratio calculation
>> +-----------------
>> +
>> +In Linux, a process has an execution time associated with it. The scheduler is
>> +dividing the time in clock ticks. The number of clock ticks per second can be
>> +found by the sysconf system call. A typical value of clock ticks per second is
>> +100. So a core can run a process at the maximum of 100 ticks per second. If a
>> +package has 4 cores, 400 ticks maximum can be scheduled on all the cores
>> +of the package for a period of 1 second.
>> +
>> +The /proc/[pid]/stat [#b]_ is a sysfs file that can give the executed time of a
>> +process with the [pid] as the process ID. It gives the amount of ticks the
>> +process has been scheduled in userspace (utime) and kernel space (stime).
>> +
>> +By reading those metrics for a thread, one can calculate the ratio of time the
>> +package has spent executing the thread.
>> +
>> +Example:
>> +
>> +A 4 cores package can schedule a maximum of 400 ticks per second with 100 ticks
>> +per second per core. If a thread was scheduled for 100 ticks between a second
>> +on this package, that means my thread has been scheduled for 1/4 of the whole
>> +package. With that, the calculation of the energy spent by the thread on this
>> +package during this whole second is 1/4 of the total energy spent by the
>> +package.
>> +
>> +Usage
>> +-----
>> +
>> +Currently this feature is only working on an Intel CPU that has the RAPL driver
>> +mounted and available in the sysfs. if not, QEMU fails at start-up.
>> +
>> +This feature is activated with -accel
>> +kvm,rapl=true,rapl-helper-socket=/path/sock.sock
>> +
>> +It is important that the socket path is the same as the one
>> +:program:`qemu-vmsr-helper` is listening to.
>> +
>> +qemu-vmsr-helper
>> +----------------
>> +
>> +The qemu-vmsr-helper is working very much like the qemu-pr-helper. Instead of
>> +making persistent reservation, qemu-vmsr-helper is here to overcome the
>> +CVE-2020-8694 which remove user access to the rapl msr attributes.
>> +
>> +A socket communication is established between QEMU processes that has the RAPL
>> +MSR support activated and the qemu-vmsr-helper. A systemd service and socket
>> +activation is provided in contrib/systemd/qemu-vmsr-helper.(service/socket).
>> +
>> +The systemd socket uses 600, like contrib/systemd/qemu-pr-helper.socket. The
>> +socket can be passed via SCM_RIGHTS by libvirt, or its permissions can be
>> +changed (e.g. 660 and root:kvm for a Debian system for example). Libvirt could
>> +also start a separate helper if needed. All in all, the policy is left to the
>> +user.
>> +
>> +See the qemu-pr-helper documentation or manpage for further details.
>> +
>> +Current Limitations
>> +-------------------
>> +
>> +- Works only on Intel host CPUs because AMD CPUs are using different MSR
>> +  addresses.
>> +
>> +- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at the
>> +  moment.
>> +
>> +References
>> +----------
>> +
>> +.. [#a] https://patchwork.kernel.org/project/kvm/patch/20200916202951.23760-7-graf@amazon.com/
>> +.. [#b] https://man7.org/linux/man-pages/man5/proc.5.html
>> diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
>> index 3f3d13f81669..1d8fb1473bdf 100644
>> --- a/include/sysemu/kvm_int.h
>> +++ b/include/sysemu/kvm_int.h
>> @@ -14,6 +14,9 @@
>>  #include "qemu/accel.h"
>>  #include "qemu/queue.h"
>>  #include "sysemu/kvm.h"
>> +#include "hw/boards.h"
>> +#include "hw/i386/topology.h"
>> +#include "io/channel-socket.h"
>
>
> I'm skeptical about pulling in x86 specific headers into generic kvm header,
> (it's miracle that it builds at all), and by extension 'board.h' as well
>
> can it be refactored in a way that you won't need to pull in 'board.h'
> and avoid using x86 specific code in generic KVM code?
> (which also applies to added below KVMHostTopoInfo)
>
>
>>  typedef struct KVMSlot
>>  {
>> @@ -50,6 +53,34 @@ typedef struct KVMMemoryListener {
>>  
>>  #define KVM_MSI_HASHTAB_SIZE    256
>>  
>> +typedef struct KVMHostTopoInfo {
>> +    /* Number of package on the Host */
>> +    unsigned int maxpkgs;
>> +    /* Number of cpus on the Host */
>> +    unsigned int maxcpus;
>> +    /* Number of cpus on each different package */
>> +    unsigned int *pkg_cpu_count;
>> +    /* Each package can have different maxticks */
>> +    unsigned int *maxticks;
>> +} KVMHostTopoInfo;
>> +
>> +struct KVMMsrEnergy {
>> +    pid_t pid;
>> +    bool enable;
>> +    char *socket_path;
>> +    QIOChannelSocket *sioc;
>> +    QemuThread msr_thr;
>> +    unsigned int guest_vcpus;
>> +    unsigned int guest_vsockets;
>> +    X86CPUTopoInfo guest_topo_info;
>> +    KVMHostTopoInfo host_topo;
>> +    const CPUArchIdList *guest_cpu_list;
>> +    uint64_t *msr_value;
>> +    uint64_t msr_unit;
>> +    uint64_t msr_limit;
>> +    uint64_t msr_info;
>> +};
>> +
>>  enum KVMDirtyRingReaperState {
>>      KVM_DIRTY_RING_REAPER_NONE = 0,
>>      /* The reaper is sleeping */
>> @@ -117,6 +148,7 @@ struct KVMState
>>      bool kvm_dirty_ring_with_bitmap;
>>      uint64_t kvm_eager_split_size;  /* Eager Page Splitting chunk size */
>>      struct KVMDirtyRingReaper reaper;
>> +    struct KVMMsrEnergy msr_energy;
>>      NotifyVmexitOption notify_vmexit;
>>      uint32_t notify_window;
>>      uint32_t xen_version;
>> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
>> index ccccb62fc353..c3891c1a6b4e 100644
>> --- a/target/i386/cpu.h
>> +++ b/target/i386/cpu.h
>> @@ -397,6 +397,10 @@ typedef enum X86Seg {
>>  #define MSR_IA32_TSX_CTRL		0x122
>>  #define MSR_IA32_TSCDEADLINE            0x6e0
>>  #define MSR_IA32_PKRS                   0x6e1
>> +#define MSR_RAPL_POWER_UNIT             0x00000606
>> +#define MSR_PKG_POWER_LIMIT             0x00000610
>> +#define MSR_PKG_ENERGY_STATUS           0x00000611
>> +#define MSR_PKG_POWER_INFO              0x00000614
>>  #define MSR_ARCH_LBR_CTL                0x000014ce
>>  #define MSR_ARCH_LBR_DEPTH              0x000014cf
>>  #define MSR_ARCH_LBR_FROM_0             0x00001500
>> @@ -1790,6 +1794,10 @@ typedef struct CPUArchState {
>>  
>>      uintptr_t retaddr;
>>  
>> +    /* RAPL MSR */
>> +    uint64_t msr_rapl_power_unit;
>> +    uint64_t msr_pkg_energy_status;
>> +
>>      /* Fields up to this point are cleared by a CPU reset */
>>      struct {} end_reset_fields;
>>  
>> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
>> index c5943605ee3a..8767c8e06028 100644
>> --- a/target/i386/kvm/kvm.c
>> +++ b/target/i386/kvm/kvm.c
>
> I'd also suggest to move most of rapl related code added here to vmsr_energy.c,
> and leave only msr access plumbing here.
> (this file is already huge and adding 400+ loc isn't helping its readability at all)
>
>> @@ -16,9 +16,12 @@
>>  #include "qapi/qapi-events-run-state.h"
>>  #include "qapi/error.h"
>>  #include "qapi/visitor.h"
>> +#include <math.h>
>>  #include <sys/ioctl.h>
>>  #include <sys/utsname.h>
>>  #include <sys/syscall.h>
>> +#include <sys/resource.h>
>> +#include <sys/time.h>
>>  
>>  #include <linux/kvm.h>
>>  #include "standard-headers/asm-x86/kvm_para.h"
>> @@ -26,6 +29,7 @@
>>  
>>  #include "cpu.h"
>>  #include "host-cpu.h"
>> +#include "vmsr_energy.h"
>>  #include "sysemu/sysemu.h"
>>  #include "sysemu/hw_accel.h"
>>  #include "sysemu/kvm_int.h"
>> @@ -2519,7 +2523,8 @@ static int kvm_get_supported_msrs(KVMState *s)
>>      return ret;
>>  }
>>  
>> -static bool kvm_rdmsr_core_thread_count(X86CPU *cpu, uint32_t msr,
>> +static bool kvm_rdmsr_core_thread_count(X86CPU *cpu,
>> +                                        uint32_t msr,
>>                                          uint64_t *val)
>>  {
>>      CPUState *cs = CPU(cpu);
>> @@ -2530,6 +2535,53 @@ static bool kvm_rdmsr_core_thread_count(X86CPU *cpu, uint32_t msr,
>>      return true;
>>  }
>>  
>> +static bool kvm_rdmsr_rapl_power_unit(X86CPU *cpu,
>> +                                      uint32_t msr,
>> +                                      uint64_t *val)
>> +{
>> +
>> +    CPUState *cs = CPU(cpu);
>> +
>> +    *val = cs->kvm_state->msr_energy.msr_unit;
>> +
>> +    return true;
>> +}
>> +
>> +static bool kvm_rdmsr_pkg_power_limit(X86CPU *cpu,
>> +                                      uint32_t msr,
>> +                                      uint64_t *val)
>> +{
>> +
>> +    CPUState *cs = CPU(cpu);
>> +
>> +    *val = cs->kvm_state->msr_energy.msr_limit;
>> +
>> +    return true;
>> +}
>> +
>> +static bool kvm_rdmsr_pkg_power_info(X86CPU *cpu,
>> +                                     uint32_t msr,
>> +                                     uint64_t *val)
>> +{
>> +
>> +    CPUState *cs = CPU(cpu);
>> +
>> +    *val = cs->kvm_state->msr_energy.msr_info;
>> +
>> +    return true;
>> +}
>> +
>> +static bool kvm_rdmsr_pkg_energy_status(X86CPU *cpu,
>> +                                        uint32_t msr,
>> +                                        uint64_t *val)
>> +{
>> +
>> +    CPUState *cs = CPU(cpu);
>> +    *val = cs->kvm_state->msr_energy.msr_value[cs->cpu_index];
>> +
>> +    return true;
>> +}
>> +
>>  static Notifier smram_machine_done;
>>  static KVMMemoryListener smram_listener;
>>  static AddressSpace smram_address_space;
>> @@ -2564,6 +2616,340 @@ static void register_smram_listener(Notifier *n, void *unused)
>>                                   &smram_address_space, 1, "kvm-smram");
>>  }
>>  
>> +static void *kvm_msr_energy_thread(void *data)
>> +{
>> +    KVMState *s = data;
>> +    struct KVMMsrEnergy *vmsr = &s->msr_energy;
>> +
>> +    g_autofree vmsr_package_energy_stat *pkg_stat = NULL;
>> +    g_autofree vmsr_thread_stat *thd_stat = NULL;
>> +    g_autofree CPUState *cpu = NULL;
>> +    g_autofree unsigned int *vpkgs_energy_stat = NULL;
>> +    unsigned int num_threads = 0;
>> +
>> +    X86CPUTopoIDs topo_ids;
>> +
>> +    rcu_register_thread();
>> +
>> +    /* Allocate memory for each package energy status */
>> +    pkg_stat = g_new0(vmsr_package_energy_stat, vmsr->host_topo.maxpkgs);
>> +
>> +    /* Allocate memory for thread stats */
>> +    thd_stat = g_new0(vmsr_thread_stat, 1);
>> +
>> +    /* Allocate memory for holding virtual package energy counter */
>> +    vpkgs_energy_stat = g_new0(unsigned int, vmsr->guest_vsockets);
>> +
>> +    /* Populate the max tick of each packages */
>> +    for (int i = 0; i < vmsr->host_topo.maxpkgs; i++) {
>> +        /*
>> +         * Max numbers of ticks per package
>> +         * Time in second * Number of ticks/second * Number of cores/package
>> +         * ex: 100 ticks/second/CPU, 12 CPUs per Package gives 1200 ticks max
>> +         */
>> +        vmsr->host_topo.maxticks[i] = (MSR_ENERGY_THREAD_SLEEP_US / 1000000)
>> +                        * sysconf(_SC_CLK_TCK)
>> +                        * vmsr->host_topo.pkg_cpu_count[i];
>> +    }
>> +
>> +    while (true) {
>> +        /* Get all qemu threads id */
>> +        g_autofree pid_t *thread_ids =
>> +            thread_ids = vmsr_get_thread_ids(vmsr->pid, &num_threads);
>> +
>> +        if (thread_ids == NULL) {
>> +            goto clean;
>> +        }
>> +
>> +        thd_stat = g_renew(vmsr_thread_stat, thd_stat, num_threads);
>> +        /* Unlike g_new0, g_renew0 function doesn't exist yet... */
>> +        memset(thd_stat, 0, num_threads * sizeof(vmsr_thread_stat));
>> +
>> +        /* Populate all the thread stats */
>> +        for (int i = 0; i < num_threads; i++) {
>> +            thd_stat[i].utime = g_new0(unsigned long long, 2);
>> +            thd_stat[i].stime = g_new0(unsigned long long, 2);
>> +            thd_stat[i].thread_id = thread_ids[i];
>> +            vmsr_read_thread_stat(vmsr->pid,
>> +                                  thd_stat[i].thread_id,
>> +                                  thd_stat[i].utime,
>> +                                  thd_stat[i].stime,
>> +                                  &thd_stat[i].cpu_id);
>> +            thd_stat[i].pkg_id =
>> +                vmsr_get_physical_package_id(thd_stat[i].cpu_id);
>> +        }
>> +
>> +        /* Retrieve all packages power plane energy counter */
>> +        for (int i = 0; i < vmsr->host_topo.maxpkgs; i++) {
>> +            for (int j = 0; j < num_threads; j++) {
>> +                /*
>> +                 * Use the first thread we found that ran on the CPU
>> +                 * of the package to read the packages energy counter
>> +                 */
>> +                if (thd_stat[j].pkg_id == i) {
>> +                    pkg_stat[i].e_start =
>> +                    vmsr_read_msr(MSR_PKG_ENERGY_STATUS,
>> +                                  thd_stat[j].cpu_id,
>> +                                  thd_stat[j].thread_id,
>> +                                  s->msr_energy.sioc);
>> +                    break;
>> +                }
>> +            }
>> +        }
>> +
>> +        /* Sleep a short period while the other threads are working */
>> +        usleep(MSR_ENERGY_THREAD_SLEEP_US);
>> +
>> +        /*
>> +         * Retrieve all packages power plane energy counter
>> +         * Calculate the delta of all packages
>> +         */
>> +        for (int i = 0; i < vmsr->host_topo.maxpkgs; i++) {
>> +            for (int j = 0; j < num_threads; j++) {
>> +                /*
>> +                 * Use the first thread we found that ran on the CPU
>> +                 * of the package to read the packages energy counter
>> +                 */
>> +                if (thd_stat[j].pkg_id == i) {
>> +                    pkg_stat[i].e_end =
>> +                    vmsr_read_msr(MSR_PKG_ENERGY_STATUS,
>> +                                  thd_stat[j].cpu_id,
>> +                                  thd_stat[j].thread_id,
>> +                                  s->msr_energy.sioc);
>> +                    /*
>> +                     * Prevent the case we have migrate the VM
>> +                     * during the sleep period or any other cases
>> +                     * were energy counter might be lower after
>> +                     * the sleep period.
>> +                     */
>> +                    if (pkg_stat[i].e_end > pkg_stat[i].e_start) {
>> +                        pkg_stat[i].e_delta =
>> +                            pkg_stat[i].e_end - pkg_stat[i].e_start;
>> +                    } else {
>> +                        pkg_stat[i].e_delta = 0;
>> +                    }
>> +                    break;
>> +                }
>> +            }
>> +        }
>> +
>> +        /* Delta of ticks spend by each thread between the sample */
>> +        for (int i = 0; i < num_threads; i++) {
>> +            vmsr_read_thread_stat(vmsr->pid,
>> +                                  thd_stat[i].thread_id,
>> +                                  thd_stat[i].utime,
>> +                                  thd_stat[i].stime,
>> +                                  &thd_stat[i].cpu_id);
>> +
>> +            if (vmsr->pid < 0) {
>> +                /*
>> +                 * We don't count the dead thread
>> +                 * i.e threads that existed before the sleep
>> +                 * and not anymore
>> +                 */
>> +                thd_stat[i].delta_ticks = 0;
>> +            } else {
>> +                vmsr_delta_ticks(thd_stat, i);
>> +            }
>> +        }
>> +
>> +        /*
>> +         * Identify the vcpu threads
>> +         * Calculate the number of vcpu per package
>> +         */
>> +        CPU_FOREACH(cpu) {
>> +            for (int i = 0; i < num_threads; i++) {
>> +                if (cpu->thread_id == thd_stat[i].thread_id) {
>> +                    thd_stat[i].is_vcpu = true;
>> +                    thd_stat[i].vcpu_id = cpu->cpu_index;
>> +                    pkg_stat[thd_stat[i].pkg_id].nb_vcpu++;
>> +                    thd_stat[i].acpi_id = kvm_arch_vcpu_id(cpu);
>> +                    break;
>> +                }
>> +            }
>> +        }
>> +
>> +        /* Retrieve the virtual package number of each vCPU */
>> +        for (int i = 0; i < vmsr->guest_cpu_list->len; i++) {
>> +            for (int j = 0; j < num_threads; j++) {
>> +                if ((thd_stat[j].acpi_id ==
>> +                        vmsr->guest_cpu_list->cpus[i].arch_id)
>> +                    && (thd_stat[j].is_vcpu == true)) {
>> +                    x86_topo_ids_from_apicid(thd_stat[j].acpi_id,
>> +                        &vmsr->guest_topo_info, &topo_ids);
>> +                    thd_stat[j].vpkg_id = topo_ids.pkg_id;
>> +                }
>> +            }
>> +        }
>> +
>> +        /* Calculate the total energy of all non-vCPU thread */
>> +        for (int i = 0; i < num_threads; i++) {
>> +            if ((thd_stat[i].is_vcpu != true) &&
>> +                (thd_stat[i].delta_ticks > 0)) {
>> +                double temp;
>> +                temp = vmsr_get_ratio(pkg_stat[thd_stat[i].pkg_id].e_delta,
>> +                    thd_stat[i].delta_ticks,
>> +                    vmsr->host_topo.maxticks[thd_stat[i].pkg_id]);
>> +                pkg_stat[thd_stat[i].pkg_id].e_ratio
>> +                    += (uint64_t)lround(temp);
>> +            }
>> +        }
>> +
>> +        /* Calculate the ratio per non-vCPU thread of each package */
>> +        for (int i = 0; i < vmsr->host_topo.maxpkgs; i++) {
>> +            if (pkg_stat[i].nb_vcpu > 0) {
>> +                pkg_stat[i].e_ratio = pkg_stat[i].e_ratio / pkg_stat[i].nb_vcpu;
>> +            }
>> +        }
>> +
>> +        /*
>> +         * Calculate the energy for each Package:
>> +         * Energy Package = sum of each vCPU energy that belongs to the package
>> +         */
>> +        for (int i = 0; i < num_threads; i++) {
>> +            if ((thd_stat[i].is_vcpu == true) && \
>> +                    (thd_stat[i].delta_ticks > 0)) {
>> +                double temp;
>> +                temp = vmsr_get_ratio(pkg_stat[thd_stat[i].pkg_id].e_delta,
>> +                    thd_stat[i].delta_ticks,
>> +                    vmsr->host_topo.maxticks[thd_stat[i].pkg_id]);
>> +                vpkgs_energy_stat[thd_stat[i].vpkg_id] +=
>> +                    (uint64_t)lround(temp);
>> +                vpkgs_energy_stat[thd_stat[i].vpkg_id] +=
>> +                    pkg_stat[thd_stat[i].pkg_id].e_ratio;
>> +            }
>> +        }
>> +
>> +        /*
>> +         * Finally populate the vmsr register of each vCPU with the total
>> +         * package value to emulate the real hardware where each CPU return the
>> +         * value of the package it belongs.
>> +         */
>> +        for (int i = 0; i < num_threads; i++) {
>> +            if ((thd_stat[i].is_vcpu == true) && \
>> +                    (thd_stat[i].delta_ticks > 0)) {
>> +                vmsr->msr_value[thd_stat[i].vcpu_id] = \
>> +                                        vpkgs_energy_stat[thd_stat[i].vpkg_id];
>> +          }
>> +        }
>> +
>> +        /* Freeing memory before zeroing the pointer */
>> +        for (int i = 0; i < num_threads; i++) {
>> +            g_free(thd_stat[i].utime);
>> +            g_free(thd_stat[i].stime);
>> +        }
>> +   }
>> +
>> +clean:
>> +    rcu_unregister_thread();
>> +    return NULL;
>> +}
>> +
>> +static int kvm_msr_energy_thread_init(KVMState *s, MachineState *ms)
>> +{
>> +    MachineClass *mc = MACHINE_GET_CLASS(ms);
>> +    struct KVMMsrEnergy *r = &s->msr_energy;
>> +    int ret = 0;
>> +
>> +    /*
>> +     * Sanity check
>> +     * 1. Host cpu must be Intel cpu
>> +     * 2. RAPL must be enabled on the Host
>> +     */
>> +    if (is_host_cpu_intel()) {
>> +        error_report("The RAPL feature can only be enabled on hosts\
>> +                      with Intel CPU models");
>> +        ret = 1;
>> +        goto out;
>> +    }
>> +
>> +    if (!is_rapl_enabled()) {
>> +        ret = 1;
>> +        goto out;
>> +    }
>> +
>> +    /* Retrieve the virtual topology */
>> +    vmsr_init_topo_info(&r->guest_topo_info, ms);
>> +
>> +    /* Retrieve the number of vcpu */
>> +    r->guest_vcpus = ms->smp.cpus;
>> +
>> +    /* Retrieve the number of virtual sockets */
>> +    r->guest_vsockets = ms->smp.sockets;
>> +
>> +    /* Allocate register memory (MSR_PKG_STATUS) for each vcpu */
>> +    r->msr_value = g_new0(uint64_t, r->guest_vcpus);
>> +
>> +    /* Retrieve the CPUArchIDlist */
>> +    r->guest_cpu_list = mc->possible_cpu_arch_ids(ms);
>> +
>> +    /* Max number of cpus on the Host */
>> +    r->host_topo.maxcpus = vmsr_get_maxcpus();
>> +    if (r->host_topo.maxcpus == 0) {
>> +        error_report("host max cpus = 0");
>> +        ret = 1;
>> +        goto out;
>> +    }
>> +
>> +    /* Max number of packages on the host */
>> +    r->host_topo.maxpkgs = vmsr_get_max_physical_package(r->host_topo.maxcpus);
>> +    if (r->host_topo.maxpkgs == 0) {
>> +        error_report("host max pkgs = 0");
>> +        ret = 1;
>> +        goto out;
>> +    }
>> +
>> +    /* Allocate memory for each package on the host */
>> +    r->host_topo.pkg_cpu_count = g_new0(unsigned int, r->host_topo.maxpkgs);
>> +    r->host_topo.maxticks = g_new0(unsigned int, r->host_topo.maxpkgs);
>> +
>> +    vmsr_count_cpus_per_package(r->host_topo.pkg_cpu_count,
>> +                                r->host_topo.maxpkgs);
>> +    for (int i = 0; i < r->host_topo.maxpkgs; i++) {
>> +        if (r->host_topo.pkg_cpu_count[i] == 0) {
>> +            error_report("cpu per packages = 0 on package_%d", i);
>> +            ret = 1;
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    /* Get QEMU PID*/
>> +    r->pid = getpid();
>> +
>> +    /* Compute the socket path if necessary */
>> +    if (s->msr_energy.socket_path == NULL) {
>> +        s->msr_energy.socket_path = vmsr_compute_default_paths();
>> +    }
>> +
>> +    /* Open socket with vmsr helper */
>> +    s->msr_energy.sioc = vmsr_open_socket(s->msr_energy.socket_path);
>> +
>> +    if (s->msr_energy.sioc == NULL) {
>> +        error_report("vmsr socket opening failed");
>> +        ret = 1;
>> +        goto out;
>> +    }
>> +
>> +    /* Those MSR values should not change */
>> +    r->msr_unit  = vmsr_read_msr(MSR_RAPL_POWER_UNIT, 0, r->pid,
>> +                                    s->msr_energy.sioc);
>> +    r->msr_limit = vmsr_read_msr(MSR_PKG_POWER_LIMIT, 0, r->pid,
>> +                                    s->msr_energy.sioc);
>> +    r->msr_info  = vmsr_read_msr(MSR_PKG_POWER_INFO, 0, r->pid,
>> +                                    s->msr_energy.sioc);
>> +    if (r->msr_unit == 0 || r->msr_limit == 0 || r->msr_info == 0) {
>> +        error_report("can't read any virtual msr");
>> +        ret = 1;
>> +        goto out;
>> +    }
>> +
>> +    qemu_thread_create(&r->msr_thr, "kvm-msr",
>> +                       kvm_msr_energy_thread,
>> +                       s, QEMU_THREAD_JOINABLE);
>> +out:
>> +    return ret;
>> +}
>> +
>>  int kvm_arch_get_default_type(MachineState *ms)
>>  {
>>      return 0;
>> @@ -2768,6 +3154,49 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>>                           strerror(-ret));
>>              exit(1);
>>          }
>> +
>> +        if (s->msr_energy.enable == true) {
>> +            r = kvm_filter_msr(s, MSR_RAPL_POWER_UNIT,
>> +                               kvm_rdmsr_rapl_power_unit, NULL);
>> +            if (!r) {
>> +                error_report("Could not install MSR_RAPL_POWER_UNIT \
>> +                                handler: %s",
>> +                             strerror(-ret));
>> +                exit(1);
>> +            }
>> +
>> +            r = kvm_filter_msr(s, MSR_PKG_POWER_LIMIT,
>> +                               kvm_rdmsr_pkg_power_limit, NULL);
>> +            if (!r) {
>> +                error_report("Could not install MSR_PKG_POWER_LIMIT \
>> +                                handler: %s",
>> +                             strerror(-ret));
>> +                exit(1);
>> +            }
>> +
>> +            r = kvm_filter_msr(s, MSR_PKG_POWER_INFO,
>> +                               kvm_rdmsr_pkg_power_info, NULL);
>> +            if (!r) {
>> +                error_report("Could not install MSR_PKG_POWER_INFO \
>> +                                handler: %s",
>> +                             strerror(-ret));
>> +                exit(1);
>> +            }
>> +            r = kvm_filter_msr(s, MSR_PKG_ENERGY_STATUS,
>> +                               kvm_rdmsr_pkg_energy_status, NULL);
>> +            if (!r) {
>> +                error_report("Could not install MSR_PKG_ENERGY_STATUS \
>> +                                handler: %s",
>> +                             strerror(-ret));
>> +                exit(1);
>> +            }
>> +            r = kvm_msr_energy_thread_init(s, ms);
>> +            if (r) {
>> +                error_report("kvm : error RAPL feature requirement not meet");
>> +                exit(1);
>> +            }
>> +
>> +        }
>>      }
>>  
>>      return 0;
>> diff --git a/target/i386/kvm/meson.build b/target/i386/kvm/meson.build
>> index e7850981e62d..3996cafaf29f 100644
>> --- a/target/i386/kvm/meson.build
>> +++ b/target/i386/kvm/meson.build
>> @@ -3,6 +3,7 @@ i386_kvm_ss = ss.source_set()
>>  i386_kvm_ss.add(files(
>>    'kvm.c',
>>    'kvm-cpu.c',
>> +  'vmsr_energy.c',
>>  ))
>>  
>>  i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen-emu.c'))
>> diff --git a/target/i386/kvm/vmsr_energy.c b/target/i386/kvm/vmsr_energy.c
>> new file mode 100644
>> index 000000000000..acf0fc0a2fb3
>> --- /dev/null
>> +++ b/target/i386/kvm/vmsr_energy.c
>> @@ -0,0 +1,344 @@
>> +/*
>> + * QEMU KVM support -- x86 virtual RAPL msr
>> + *
>> + * Copyright 2024 Red Hat, Inc. 2024
>> + *
>> + *  Author:
>> + *      Anthony Harivel <aharivel@redhat.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + *
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/error-report.h"
>> +#include "vmsr_energy.h"
>> +#include "io/channel.h"
>> +#include "io/channel-socket.h"
>> +#include "hw/boards.h"
>> +#include "cpu.h"
>> +#include "host-cpu.h"
>> +
>> +char *vmsr_compute_default_paths(void)
>> +{
>> +    g_autofree char *state = qemu_get_local_state_dir();
>> +
>> +    return g_build_filename(state, "run", "qemu-vmsr-helper.sock", NULL);
>> +}
>> +
>> +bool is_host_cpu_intel(void)
>> +{
>> +    int family, model, stepping;
>> +    char vendor[CPUID_VENDOR_SZ + 1];
>> +
>> +    host_cpu_vendor_fms(vendor, &family, &model, &stepping);
>> +
>> +    return strcmp(vendor, CPUID_VENDOR_INTEL);
>> +}
>> +
>> +int is_rapl_enabled(void)
>> +{
>> +    const char *path = "/sys/class/powercap/intel-rapl/enabled";
>> +    FILE *file = fopen(path, "r");
>> +    int value = 0;
>> +
>> +    if (file != NULL) {
>> +        if (fscanf(file, "%d", &value) != 1) {
>> +            error_report("INTEL RAPL not enabled");
>> +        }
>> +        fclose(file);
>> +    } else {
>> +        error_report("Error opening %s", path);
>> +    }
>> +
>> +    return value;
>> +}
>> +
>> +QIOChannelSocket *vmsr_open_socket(const char *path)
>> +{
>> +    g_autofree char *socket_path = NULL;
>> +
>> +    socket_path = g_strdup(path);
>> +
>> +    SocketAddress saddr = {
>> +        .type = SOCKET_ADDRESS_TYPE_UNIX,
>> +        .u.q_unix.path = socket_path
>> +    };
>> +
>> +    QIOChannelSocket *sioc = qio_channel_socket_new();
>> +    Error *local_err = NULL;
>> +
>> +    qio_channel_set_name(QIO_CHANNEL(sioc), "vmsr-helper");
>> +    qio_channel_socket_connect_sync(sioc,
>> +                                    &saddr,
>> +                                    &local_err);
>> +    if (local_err) {
>> +        /* Close socket. */
>> +        qio_channel_close(QIO_CHANNEL(sioc), NULL);
>> +        object_unref(OBJECT(sioc));
>> +        sioc = NULL;
>> +        goto out;
>> +    }
>> +
>> +    qio_channel_set_delay(QIO_CHANNEL(sioc), false);
>> +out:
>> +    return sioc;
>> +}
>> +
>> +uint64_t vmsr_read_msr(uint32_t reg, uint32_t cpu_id, uint32_t tid,
>> +                       QIOChannelSocket *sioc)
>> +{
>> +    uint64_t data = 0;
>> +    int r = 0;
>> +    Error *local_err = NULL;
>> +    uint32_t buffer[3];
>> +    /*
>> +     * Send the required arguments:
>> +     * 1. RAPL MSR register to read
>> +     * 2. On which CPU ID
>> +     * 3. From which vCPU (Thread ID)
>> +     */
>> +    buffer[0] = reg;
>> +    buffer[1] = cpu_id;
>> +    buffer[2] = tid;
>> +
>> +    r = qio_channel_write_all(QIO_CHANNEL(sioc),
>> +                              (char *)buffer, sizeof(buffer),
>> +                              &local_err);
>> +    if (r < 0) {
>> +        goto out_close;
>> +    }
>> +
>> +    r = qio_channel_read(QIO_CHANNEL(sioc),
>> +                             (char *)&data, sizeof(data),
>> +                             &local_err);
>> +    if (r < 0) {
>> +        data = 0;
>> +        goto out_close;
>> +    }
>> +
>> +out_close:
>> +   return data;
>> +}
>> +
>> +/* Retrieve the max number of physical package */
>> +unsigned int vmsr_get_max_physical_package(unsigned int max_cpus)
>> +{
>> +    const char *dir = "/sys/devices/system/cpu/";
>> +    const char *topo_path = "topology/physical_package_id";
>> +    g_autofree int *uniquePackages = g_new0(int, max_cpus);
>> +    unsigned int packageCount = 0;
>> +    FILE *file = NULL;
>> +
>> +    for (int i = 0; i < max_cpus; i++) {
>> +        g_autofree char *filePath = NULL;
>> +        g_autofree char *cpuid = g_strdup_printf("cpu%d", i);
>> +
>> +        filePath = g_build_filename(dir, cpuid, topo_path, NULL);
>> +
>> +        file = fopen(filePath, "r");
>> +
>> +        if (file == NULL) {
>> +            error_report("Error opening physical_package_id file");
>> +            return 0;
>> +        }
>> +
>> +        char packageId[10];
>> +        if (fgets(packageId, sizeof(packageId), file) == NULL) {
>> +            packageCount = 0;
>> +        }
>> +
>> +        fclose(file);
>> +
>> +        int currentPackageId = atoi(packageId);
>> +
>> +        bool isUnique = true;
>> +        for (int j = 0; j < packageCount; j++) {
>> +            if (uniquePackages[j] == currentPackageId) {
>> +                isUnique = false;
>> +                break;
>> +            }
>> +        }
>> +
>> +        if (isUnique) {
>> +            uniquePackages[packageCount] = currentPackageId;
>> +            packageCount++;
>> +
>> +            if (packageCount >= max_cpus) {
>> +                break;
>> +            }
>> +        }
>> +    }
>> +
>> +    return (packageCount == 0) ? 1 : packageCount;
>> +}
>> +
>> +/* Retrieve the max number of physical cpu on the host */
>> +unsigned int vmsr_get_maxcpus(void)
>> +{
>> +    GDir *dir;
>> +    const gchar *entry_name;
>> +    unsigned int cpu_count = 0;
>> +    const char *path = "/sys/devices/system/cpu/";
>> +
>> +    dir = g_dir_open(path, 0, NULL);
>> +    if (dir == NULL) {
>> +        error_report("Unable to open cpu directory");
>> +        return -1;
>> +    }
>> +
>> +    while ((entry_name = g_dir_read_name(dir)) != NULL) {
>> +        if (g_ascii_strncasecmp(entry_name, "cpu", 3) == 0 &&
>> +            isdigit(entry_name[3])) {
>> +            cpu_count++;
>> +        }
>> +    }
>> +
>> +    g_dir_close(dir);
>> +
>> +    return cpu_count;
>> +}
>> +
>> +/* Count the number of physical cpu on each packages */
>> +unsigned int vmsr_count_cpus_per_package(unsigned int *package_count,
>> +                                         unsigned int max_pkgs)
>> +{
>> +    g_autofree char *file_contents = NULL;
>> +    g_autofree char *path = NULL;
>> +    g_autofree char *path_name = NULL;
>> +    gsize length;
>> +
>> +    /* Iterate over cpus and count cpus in each package */
>> +    for (int cpu_id = 0; ; cpu_id++) {
>> +        path_name = g_strdup_printf("/sys/devices/system/cpu/cpu%d/"
>> +            "topology/physical_package_id", cpu_id);
>> +
>> +        path = g_build_filename(path_name, NULL);
>> +
>> +        if (!g_file_get_contents(path, &file_contents, &length, NULL)) {
>> +            break; /* No more cpus */
>> +        }
>> +
>> +        /* Get the physical package ID for this CPU */
>> +        int package_id = atoi(file_contents);
>> +
>> +        /* Check if the package ID is within the known number of packages */
>> +        if (package_id >= 0 && package_id < max_pkgs) {
>> +            /* If yes, count the cpu for this package*/
>> +            package_count[package_id]++;
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/* Get the physical package id from a given cpu id */
>> +int vmsr_get_physical_package_id(int cpu_id)
>> +{
>> +    g_autofree char *file_contents = NULL;
>> +    g_autofree char *file_path = NULL;
>> +    int package_id = -1;
>> +    gsize length;
>> +
>> +    file_path = g_strdup_printf("/sys/devices/system/cpu/cpu%d"
>> +        "/topology/physical_package_id", cpu_id);
>> +
>> +    if (!g_file_get_contents(file_path, &file_contents, &length, NULL)) {
>> +        goto out;
>> +    }
>> +
>> +    package_id = atoi(file_contents);
>> +
>> +out:
>> +    return package_id;
>> +}
>> +
>> +/* Read the scheduled time for a given thread of a give pid */
>> +void vmsr_read_thread_stat(pid_t pid,
>> +                      unsigned int thread_id,
>> +                      unsigned long long *utime,
>> +                      unsigned long long *stime,
>> +                      unsigned int *cpu_id)
>> +{
>> +    g_autofree char *path = NULL;
>> +    g_autofree char *path_name = NULL;
>> +
>> +    path_name = g_strdup_printf("/proc/%u/task/%d/stat", pid, thread_id);
>> +
>> +    path = g_build_filename(path_name, NULL);
>> +
>> +    FILE *file = fopen(path, "r");
>> +    if (file == NULL) {
>> +        pid = -1;
>> +        return;
>> +    }
>> +
>> +    if (fscanf(file, "%*d (%*[^)]) %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u"
>> +        " %llu %llu %*d %*d %*d %*d %*d %*d %*u %*u %*d %*u %*u"
>> +        " %*u %*u %*u %*u %*u %*u %*u %*u %*u %*d %*u %*u %u",
>> +           utime, stime, cpu_id) != 3)
>> +    {
>> +        pid = -1;
>> +        return;
>> +    }
>> +
>> +    fclose(file);
>> +    return;
>> +}
>> +
>> +/* Read QEMU stat task folder to retrieve all QEMU threads ID */
>> +pid_t *vmsr_get_thread_ids(pid_t pid, unsigned int *num_threads)
>> +{
>> +    g_autofree char *task_path = g_strdup_printf("%d/task", pid);
>> +    g_autofree char *path = g_build_filename("/proc", task_path, NULL);
>> +
>> +    DIR *dir = opendir(path);
>> +    if (dir == NULL) {
>> +        error_report("Error opening /proc/qemu/task");
>> +        return NULL;
>> +    }
>> +
>> +    pid_t *thread_ids = NULL;
>> +    unsigned int thread_count = 0;
>> +
>> +    g_autofree struct dirent *ent = NULL;
>> +    while ((ent = readdir(dir)) != NULL) {
>> +        if (ent->d_name[0] == '.') {
>> +            continue;
>> +        }
>> +        pid_t tid = atoi(ent->d_name);
>> +        if (pid != tid) {
>> +            thread_ids = g_renew(pid_t, thread_ids, (thread_count + 1));
>> +            thread_ids[thread_count] = tid;
>> +            thread_count++;
>> +        }
>> +    }
>> +
>> +    closedir(dir);
>> +
>> +    *num_threads = thread_count;
>> +    return thread_ids;
>> +}
>> +
>> +void vmsr_delta_ticks(vmsr_thread_stat *thd_stat, int i)
>> +{
>> +    thd_stat[i].delta_ticks = (thd_stat[i].utime[1] + thd_stat[i].stime[1])
>> +                            - (thd_stat[i].utime[0] + thd_stat[i].stime[0]);
>> +}
>> +
>> +double vmsr_get_ratio(uint64_t e_delta,
>> +                      unsigned long long delta_ticks,
>> +                      unsigned int maxticks)
>> +{
>> +    return (e_delta / 100.0) * ((100.0 / maxticks) * delta_ticks);
>> +}
>> +
>> +void vmsr_init_topo_info(X86CPUTopoInfo *topo_info,
>> +                           const MachineState *ms)
>> +{
>> +    topo_info->dies_per_pkg = ms->smp.dies;
>> +    topo_info->cores_per_die = ms->smp.cores;
>> +    topo_info->threads_per_core = ms->smp.threads;
>> +}
>> +
>> diff --git a/target/i386/kvm/vmsr_energy.h b/target/i386/kvm/vmsr_energy.h
>> new file mode 100644
>> index 000000000000..16cc1f4814f6
>> --- /dev/null
>> +++ b/target/i386/kvm/vmsr_energy.h
>> @@ -0,0 +1,99 @@
>> +/*
>> + * QEMU KVM support -- x86 virtual energy-related MSR.
>> + *
>> + * Copyright 2024 Red Hat, Inc. 2024
>> + *
>> + *  Author:
>> + *      Anthony Harivel <aharivel@redhat.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + *
>> + */
>> +
>> +#ifndef VMSR_ENERGY_H
>> +#define VMSR_ENERGY_H
>> +
>> +#include <stdint.h>
>> +#include "qemu/osdep.h"
>> +#include "io/channel-socket.h"
>> +#include "hw/i386/topology.h"
>> +
>> +/*
>> + * Define the interval time in micro seconds between 2 samples of
>> + * energy related MSRs
>> + */
>> +#define MSR_ENERGY_THREAD_SLEEP_US 1000000.0
>> +
>> +/*
>> + * Thread statistic
>> + * @ thread_id: TID (thread ID)
>> + * @ is_vcpu: true if TID is vCPU thread
>> + * @ cpu_id: CPU number last executed on
>> + * @ pkg_id: package number of the CPU
>> + * @ vcpu_id: vCPU ID
>> + * @ vpkg: virtual package number
>> + * @ acpi_id: APIC id of the vCPU
>> + * @ utime: amount of clock ticks the thread
>> + *          has been scheduled in User mode
>> + * @ stime: amount of clock ticks the thread
>> + *          has been scheduled in System mode
>> + * @ delta_ticks: delta of utime+stime between
>> + *          the two samples (before/after sleep)
>> + */
>> +struct vmsr_thread_stat {
>> +    unsigned int thread_id;
>> +    bool is_vcpu;
>> +    unsigned int cpu_id;
>> +    unsigned int pkg_id;
>> +    unsigned int vpkg_id;
>> +    unsigned int vcpu_id;
>> +    unsigned long acpi_id;
>> +    unsigned long long *utime;
>> +    unsigned long long *stime;
>> +    unsigned long long delta_ticks;
>> +};
>> +
>> +/*
>> + * Package statistic
>> + * @ e_start: package energy counter before the sleep
>> + * @ e_end: package energy counter after the sleep
>> + * @ e_delta: delta of package energy counter
>> + * @ e_ratio: store the energy ratio of non-vCPU thread
>> + * @ nb_vcpu: number of vCPU running on this package
>> + */
>> +struct vmsr_package_energy_stat {
>> +    uint64_t e_start;
>> +    uint64_t e_end;
>> +    uint64_t e_delta;
>> +    uint64_t e_ratio;
>> +    unsigned int nb_vcpu;
>> +};
>> +
>> +typedef struct vmsr_thread_stat vmsr_thread_stat;
>> +typedef struct vmsr_package_energy_stat vmsr_package_energy_stat;
>> +
>> +char *vmsr_compute_default_paths(void);
>> +void vmsr_read_thread_stat(pid_t pid,
>> +                      unsigned int thread_id,
>> +                      unsigned long long *utime,
>> +                      unsigned long long *stime,
>> +                      unsigned int *cpu_id);
>> +
>> +QIOChannelSocket *vmsr_open_socket(const char *path);
>> +uint64_t vmsr_read_msr(uint32_t reg, uint32_t cpu_id,
>> +                       uint32_t tid, QIOChannelSocket *sioc);
>> +void vmsr_delta_ticks(vmsr_thread_stat *thd_stat, int i);
>> +unsigned int vmsr_get_maxcpus(void);
>> +unsigned int vmsr_get_max_physical_package(unsigned int max_cpus);
>> +unsigned int vmsr_count_cpus_per_package(unsigned int *package_count,
>> +                                         unsigned int max_pkgs);
>> +int vmsr_get_physical_package_id(int cpu_id);
>> +pid_t *vmsr_get_thread_ids(pid_t pid, unsigned int *num_threads);
>> +double vmsr_get_ratio(uint64_t e_delta,
>> +                      unsigned long long delta_ticks,
>> +                      unsigned int maxticks);
>> +void vmsr_init_topo_info(X86CPUTopoInfo *topo_info, const MachineState *ms);
>> +bool is_host_cpu_intel(void);
>> +int is_rapl_enabled(void);
>> +#endif /* VMSR_ENERGY_H */



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 0/3] Add support for the RAPL MSRs series
  2024-10-16 12:56   ` Anthony Harivel
@ 2024-10-18 12:25     ` Igor Mammedov
  2024-10-18 12:59       ` Daniel P. Berrangé
  2024-10-22 13:49       ` Anthony Harivel
  0 siblings, 2 replies; 25+ messages in thread
From: Igor Mammedov @ 2024-10-18 12:25 UTC (permalink / raw)
  To: Anthony Harivel
  Cc: pbonzini, mtosatti, berrange, qemu-devel, vchundur, rjarry

On Wed, 16 Oct 2024 14:56:39 +0200
"Anthony Harivel" <aharivel@redhat.com> wrote:

> Hi Igor,
> 
> Igor Mammedov, Oct 16, 2024 at 13:52:
> > On Wed, 22 May 2024 17:34:49 +0200
> > Anthony Harivel <aharivel@redhat.com> wrote:
> >  
> >> Dear maintainers, 
> >> 
> >> First of all, thank you very much for your review of my patch 
> >> [1].  
> >
> > I've tried to play with this feature and have a few questions about it
> >  
> 
> Thanks for testing this new feature. 
> 
> >  1. trying to start with non accessible or not existent socket
> >         -accel kvm,rapl=on,rapl-helper-socket=/tmp/socket 
> >     I get:
> >       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: vmsr socket opening failed
> >       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: kvm : error RAPL feature requirement not met
> >     * is it possible to report actual OS error that happened during open/connect,
> >       instead of unhelpful 'socket opening failed'?
> >
> >       What I see in vmsr_open_socket() error is ignored
> >       and btw it's error leak as well
> >  
> 
> Shame you missed the 6 iterations of that patch that last for a year. 
> I would have changed that directly !
> Anyway I take note on that comment and will send a modification.
> 
> >     * 2nd line shouldn't be there if the 1st error already present.
> >
> >  2.  getting periodic error on console where QEMU has been starter
> >       # ./qemu-vmsr-helper -k /tmp/sock
> >      ./qemu-system-x86_64 -snapshot -m 4G -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock rhel90.img  -vnc :0 -cpu host
> >      and let it run
> >
> >       it appears rdmsr works (well, it returns some values at least)
> >       however there are recurring errors in qemu's stderr(or out)
> >       
> >       qemu-system-x86_64: Error opening /proc/2496093/task/2496109/stat
> >       qemu-system-x86_64: Error opening /proc/2496093/task/2496095/stat
> >
> >       My guess it's some temporary threads, that come and go, but still
> >       they shouldn't cause errors if it's normal operation.
> >  
> 
> There a patch in WIP that change this into a Tracepoint. Maybe you can 
> SSH to the VM in meanwhile ?

it's just idling VM that doesn't do anything, hence the question.  

> 
> >       Also on daemon side, I a few times got while guest was running:
> >         qemu-vmsr-helper: Failed to open /proc at /proc/2496026/task/2496044
> >         qemu-vmsr-helper: Requested TID not in peer PID: 2496026 2496044
> >       though I can't reproduce it reliably  
> 
> This could happen only when a vCPU thread ID has changed between the 
> call of a rdmsr throught the socket and the hepler that read the msr.
> No idea how a vCPU can change TID or shutdown that fast.

I guess it needs to be figured out to decide if it's safe to ignore (and not print error)
or if it's a genuine error/bug somewhere

> >  3. when starting daemon not as root, it starts 'fine' but later on complains
> >       qemu-vmsr-helper: Failed to open MSR file at /dev/cpu/0/msr
> >     perhaps it would be better to fail at start daemon if it doesn't have
> >     access to necessary files.
> >  
> 
> Right taking a note on that as well.
> 
> 
> >  4. in case #3, guest also fails to start with errors:
> >       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: can't read any virtual msr
> >       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: kvm : error RAPL feature requirement not met
> >      again line #2 is not useful and probably not needed (maybe make it tracepoint)
> >      and #1 is unhelpful - it would be better if it directed user to check qemu-vmsr-helper
> >  
> 
> I will try to see how to improve that part. 
> Thanks for your valuable feedback.
> 
> >  5. does AMD have similar MSRs that we could use to make this feature complete?
> >  
> 
> Yes but the address are completely different. However, this in my ToDo 
> list. First I need way more feedback like yours to move on extending 
> this feature.

If adding AMD's MSRs is not difficult, then I'd make it priority.
This way users (and libvirt) won't have to deal with 2 different
feature-sets and decide when to allow this to be turned on depending on host.

> 
> >  6. What happens to power accounting if host constantly migrates
> >     vcpus between sockets, are values we are getting still correct/meaningful?
> >     Or do we need to pin vcpus to get 'accurate' values?
> >  
> 
> It's taken into account during the ratio calculation which socket the 
> vCPU has just been scheduled. But yes the value are more 'accurate' when 
> the vCPU is pinned.

in worst case VCPUs might be moved between sockets many times during
sample window, can you explain how that is accounted for?

Anyways, it would be better to have some numbers in doc that would
clarify what kind of accuracy we are talking about (and example
pinned vs unpinned), or whether unpinned case measures average
temperature of patients in hospital and we should recommend
to pin vcpus and everything else.

Also actual usecase examples for the feature should be mentioned
in the doc. So users could figure out when they need to enable
this feature (with attached accuracy numbers). Aka how this
new feature is good for end users and what they can do with it.
 
> >  7. do we have to have a dedicated thread for pooling data from daemon?
> >
> >     Can we fetch data from vcpu thread that have accessed msr
> >     (with some caching and rate limiting access to the daemon)?
> >  
> 
> This feature is revolving around a thread. Please look at the 
> documentation is not already done:
> 
> https://www.qemu.org/docs/master/specs/rapl-msr.html#high-level-implementation
> 
> If we only fetch from vCPU thread, we won't have the consumption of the 
> non-vcpu thread. They are taken into account in the total.

one can collect the same data from vcpu thread as well,
the bonus part is that we don't have an extra thread
hanging around and doing work even if guest never asks
for those MSRs.

This also leads to a question, if we should account for
not VCPU threads at all. Looking at real hardware, those
MSRs return power usage of CPUs only, and they do not
return consumption from auxiliary system components
(io/memory/...). One can consider non VCPU threads in QEMU
as auxiliary components, so we probably should not to
account for them at all when modeling the same hw feature.
(aka be consistent with what real hw does).

> Thanks again for your feedback. 
> 
> Anthony
> 
> 
> >> In this version (v6), I have attempted to address all the problems 
> >> addressed by Daniel and Paolo during the last review. 
> >> 
> >> However, two open questions remains unanswered that would require the 
> >> attention of a x86 maintainers: 
> >> 
> >> 1)Should I move from -kvm to -cpu the rapl feature ? [2]
> >> 
> >> 2)Should I already rename to "rapl_vmsr_*" in order to anticipate the 
> >>   futur TMPI architecture ? [end of 3] 
> >> 
> >> Thank you again for your continued guidance. 
> >> 
> >> v5 -> v6
> >> --------
> >> - Better error consistency in qio_channel_get_peerpid()
> >> - Memory leak g_strdup_printf/g_build_filename corrected
> >> - Renaming several struct with "vmsr_*" for better namespace
> >> - Renamed several struct with "guest_*" for better comprehension
> >> - Optimization suggerate from Daniel
> >> - Crash problem solved [4]
> >> 
> >> v4 -> v5
> >> --------
> >> 
> >> - correct qio_channel_get_peerpid: return pid = -1 in case of error
> >> - Vmsr_helper: compile only for x86
> >> - Vmsr_helper: use qio_channel_read/write_all
> >> - Vmsr_helper: abandon user/group
> >> - Vmsr_energy.c: correct all error_report
> >> - Vmsr thread: compute default socket path only once
> >> - Vmsr thread: open socket only once
> >> - Pass relevant QEMU CI
> >> 
> >> v3 -> v4
> >> --------
> >> 
> >> - Correct memory leaks with AddressSanitizer  
> >> - Add sanity check for QEMU and qemu-vmsr-helper for checking if host is 
> >>   INTEL and if RAPL is activated.
> >> - Rename poor variables naming for easier comprehension
> >> - Move code that checks Host before creating the VMSR thread
> >> - Get rid of libnuma: create function that read sysfs for reading the 
> >>   Host topology instead
> >> 
> >> v2 -> v3
> >> --------
> >> 
> >> - Move all memory allocations from Clib to Glib
> >> - Compile on *BSD (working on Linux only)
> >> - No more limitation on the virtual package: each vCPU that belongs to 
> >>   the same virtual package is giving the same results like expected on 
> >>   a real CPU.
> >>   This has been tested topology like:
> >>      -smp 4,sockets=2
> >>      -smp 16,sockets=4,cores=2,threads=2
> >> 
> >> v1 -> v2
> >> --------
> >> 
> >> - To overcome the CVE-2020-8694 a socket communication is created
> >>   to a priviliged helper
> >> - Add the priviliged helper (qemu-vmsr-helper)
> >> - Add SO_PEERCRED in qio channel socket
> >> 
> >> RFC -> v1
> >> ---------
> >> 
> >> - Add vmsr_* in front of all vmsr specific function
> >> - Change malloc()/calloc()... with all glib equivalent
> >> - Pre-allocate all dynamic memories when possible
> >> - Add a Documentation of implementation, limitation and usage
> >> 
> >> Best regards,
> >> Anthony
> >> 
> >> [1]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg01570.html
> >> [2]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg03947.html
> >> [3]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02350.html
> >> [4]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02481.html
> >> 
> >> Anthony Harivel (3):
> >>   qio: add support for SO_PEERCRED for socket channel
> >>   tools: build qemu-vmsr-helper
> >>   Add support for RAPL MSRs in KVM/Qemu
> >> 
> >>  accel/kvm/kvm-all.c                      |  27 ++
> >>  contrib/systemd/qemu-vmsr-helper.service |  15 +
> >>  contrib/systemd/qemu-vmsr-helper.socket  |   9 +
> >>  docs/specs/index.rst                     |   1 +
> >>  docs/specs/rapl-msr.rst                  | 155 +++++++
> >>  docs/tools/index.rst                     |   1 +
> >>  docs/tools/qemu-vmsr-helper.rst          |  89 ++++
> >>  include/io/channel.h                     |  21 +
> >>  include/sysemu/kvm_int.h                 |  32 ++
> >>  io/channel-socket.c                      |  28 ++
> >>  io/channel.c                             |  13 +
> >>  meson.build                              |   7 +
> >>  target/i386/cpu.h                        |   8 +
> >>  target/i386/kvm/kvm.c                    | 431 +++++++++++++++++-
> >>  target/i386/kvm/meson.build              |   1 +
> >>  target/i386/kvm/vmsr_energy.c            | 337 ++++++++++++++
> >>  target/i386/kvm/vmsr_energy.h            |  99 +++++
> >>  tools/i386/qemu-vmsr-helper.c            | 530 +++++++++++++++++++++++
> >>  tools/i386/rapl-msr-index.h              |  28 ++
> >>  19 files changed, 1831 insertions(+), 1 deletion(-)
> >>  create mode 100644 contrib/systemd/qemu-vmsr-helper.service
> >>  create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
> >>  create mode 100644 docs/specs/rapl-msr.rst
> >>  create mode 100644 docs/tools/qemu-vmsr-helper.rst
> >>  create mode 100644 target/i386/kvm/vmsr_energy.c
> >>  create mode 100644 target/i386/kvm/vmsr_energy.h
> >>  create mode 100644 tools/i386/qemu-vmsr-helper.c
> >>  create mode 100644 tools/i386/rapl-msr-index.h
> >>   
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 0/3] Add support for the RAPL MSRs series
  2024-10-18 12:25     ` Igor Mammedov
@ 2024-10-18 12:59       ` Daniel P. Berrangé
  2024-10-22 12:46         ` Igor Mammedov
  2024-10-22 13:49       ` Anthony Harivel
  1 sibling, 1 reply; 25+ messages in thread
From: Daniel P. Berrangé @ 2024-10-18 12:59 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Anthony Harivel, pbonzini, mtosatti, qemu-devel, vchundur, rjarry

On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote:
> On Wed, 16 Oct 2024 14:56:39 +0200
> "Anthony Harivel" <aharivel@redhat.com> wrote:
> 
> > Hi Igor,
> > 
> > Igor Mammedov, Oct 16, 2024 at 13:52:
> > > On Wed, 22 May 2024 17:34:49 +0200
> > > Anthony Harivel <aharivel@redhat.com> wrote:
> > >  
> > >> Dear maintainers, 
> > >> 
> > >> First of all, thank you very much for your review of my patch 
> > >> [1].  
> > >
> > > I've tried to play with this feature and have a few questions about it
> > >  
> > 
> > Thanks for testing this new feature. 
> > 
> > >  1. trying to start with non accessible or not existent socket
> > >         -accel kvm,rapl=on,rapl-helper-socket=/tmp/socket 
> > >     I get:
> > >       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: vmsr socket opening failed
> > >       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: kvm : error RAPL feature requirement not met
> > >     * is it possible to report actual OS error that happened during open/connect,
> > >       instead of unhelpful 'socket opening failed'?
> > >
> > >       What I see in vmsr_open_socket() error is ignored
> > >       and btw it's error leak as well
> > >  
> > 
> > Shame you missed the 6 iterations of that patch that last for a year. 
> > I would have changed that directly !
> > Anyway I take note on that comment and will send a modification.
> > 
> > >     * 2nd line shouldn't be there if the 1st error already present.
> > >
> > >  2.  getting periodic error on console where QEMU has been starter
> > >       # ./qemu-vmsr-helper -k /tmp/sock
> > >      ./qemu-system-x86_64 -snapshot -m 4G -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock rhel90.img  -vnc :0 -cpu host
> > >      and let it run
> > >
> > >       it appears rdmsr works (well, it returns some values at least)
> > >       however there are recurring errors in qemu's stderr(or out)
> > >       
> > >       qemu-system-x86_64: Error opening /proc/2496093/task/2496109/stat
> > >       qemu-system-x86_64: Error opening /proc/2496093/task/2496095/stat
> > >
> > >       My guess it's some temporary threads, that come and go, but still
> > >       they shouldn't cause errors if it's normal operation.
> > >  
> > 
> > There a patch in WIP that change this into a Tracepoint. Maybe you can 
> > SSH to the VM in meanwhile ?
> 
> it's just idling VM that doesn't do anything, hence the question.  
> 
> > 
> > >       Also on daemon side, I a few times got while guest was running:
> > >         qemu-vmsr-helper: Failed to open /proc at /proc/2496026/task/2496044
> > >         qemu-vmsr-helper: Requested TID not in peer PID: 2496026 2496044
> > >       though I can't reproduce it reliably  
> > 
> > This could happen only when a vCPU thread ID has changed between the 
> > call of a rdmsr throught the socket and the hepler that read the msr.
> > No idea how a vCPU can change TID or shutdown that fast.
> 
> I guess it needs to be figured out to decide if it's safe to ignore (and not print error)
> or if it's a genuine error/bug somewhere
> 
> > >  3. when starting daemon not as root, it starts 'fine' but later on complains
> > >       qemu-vmsr-helper: Failed to open MSR file at /dev/cpu/0/msr
> > >     perhaps it would be better to fail at start daemon if it doesn't have
> > >     access to necessary files.
> > >  
> > 
> > Right taking a note on that as well.
> > 
> > 
> > >  4. in case #3, guest also fails to start with errors:
> > >       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: can't read any virtual msr
> > >       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: kvm : error RAPL feature requirement not met
> > >      again line #2 is not useful and probably not needed (maybe make it tracepoint)
> > >      and #1 is unhelpful - it would be better if it directed user to check qemu-vmsr-helper
> > >  
> > 
> > I will try to see how to improve that part. 
> > Thanks for your valuable feedback.
> > 
> > >  5. does AMD have similar MSRs that we could use to make this feature complete?
> > >  
> > 
> > Yes but the address are completely different. However, this in my ToDo 
> > list. First I need way more feedback like yours to move on extending 
> > this feature.
> 
> If adding AMD's MSRs is not difficult, then I'd make it priority.
> This way users (and libvirt) won't have to deal with 2 different
> feature-sets and decide when to allow this to be turned on depending on host.
> 
> > 
> > >  6. What happens to power accounting if host constantly migrates
> > >     vcpus between sockets, are values we are getting still correct/meaningful?
> > >     Or do we need to pin vcpus to get 'accurate' values?
> > >  
> > 
> > It's taken into account during the ratio calculation which socket the 
> > vCPU has just been scheduled. But yes the value are more 'accurate' when 
> > the vCPU is pinned.
> 
> in worst case VCPUs might be moved between sockets many times during
> sample window, can you explain how that is accounted for?
> 
> Anyways, it would be better to have some numbers in doc that would
> clarify what kind of accuracy we are talking about (and example
> pinned vs unpinned), or whether unpinned case measures average
> temperature of patients in hospital and we should recommend
> to pin vcpus and everything else.
> 
> Also actual usecase examples for the feature should be mentioned
> in the doc. So users could figure out when they need to enable
> this feature (with attached accuracy numbers). Aka how this
> new feature is good for end users and what they can do with it.
>  
> > >  7. do we have to have a dedicated thread for pooling data from daemon?
> > >
> > >     Can we fetch data from vcpu thread that have accessed msr
> > >     (with some caching and rate limiting access to the daemon)?
> > >  
> > 
> > This feature is revolving around a thread. Please look at the 
> > documentation is not already done:
> > 
> > https://www.qemu.org/docs/master/specs/rapl-msr.html#high-level-implementation
> > 
> > If we only fetch from vCPU thread, we won't have the consumption of the 
> > non-vcpu thread. They are taken into account in the total.
> 
> one can collect the same data from vcpu thread as well,
> the bonus part is that we don't have an extra thread
> hanging around and doing work even if guest never asks
> for those MSRs.
> 
> This also leads to a question, if we should account for
> not VCPU threads at all. Looking at real hardware, those
> MSRs return power usage of CPUs only, and they do not
> return consumption from auxiliary system components
> (io/memory/...). One can consider non VCPU threads in QEMU
> as auxiliary components, so we probably should not to
> account for them at all when modeling the same hw feature.
> (aka be consistent with what real hw does).

I understand your POV, but I think that would be a mistake,
and would undermine the usefulness of the feature.

The deployment model has a cluster of hosts and guests, all
belonging to the same user. The user goal is to measure host
power consumption imposed by the guest, and dynamically adjust
guest workloads in order to minimize power consumption of the
host.

The guest workloads can impose non-negligble power consumption
loads on non-vCPU threads in QEMU. Without that accounted for,
any adjustments will be working from (sometimes very) inaccurate
data.

IOW, I think it is right to include non-vCPU threads usage in
the reported info, as it is still fundamentally part of the
load that the guest imposes on host pCPUs it is permitted to
run on.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 0/3] Add support for the RAPL MSRs series
  2024-10-18 12:59       ` Daniel P. Berrangé
@ 2024-10-22 12:46         ` Igor Mammedov
  2024-10-22 13:15           ` Daniel P. Berrangé
  0 siblings, 1 reply; 25+ messages in thread
From: Igor Mammedov @ 2024-10-22 12:46 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Anthony Harivel, pbonzini, mtosatti, qemu-devel, vchundur, rjarry

On Fri, 18 Oct 2024 13:59:34 +0100
Daniel P. Berrangé <berrange@redhat.com> wrote:

> On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote:
> > On Wed, 16 Oct 2024 14:56:39 +0200
> > "Anthony Harivel" <aharivel@redhat.com> wrote:
[...]

> > 
> > This also leads to a question, if we should account for
> > not VCPU threads at all. Looking at real hardware, those
> > MSRs return power usage of CPUs only, and they do not
> > return consumption from auxiliary system components
> > (io/memory/...). One can consider non VCPU threads in QEMU
> > as auxiliary components, so we probably should not to
> > account for them at all when modeling the same hw feature.
> > (aka be consistent with what real hw does).  
> 
> I understand your POV, but I think that would be a mistake,
> and would undermine the usefulness of the feature.
> 
> The deployment model has a cluster of hosts and guests, all
> belonging to the same user. The user goal is to measure host
> power consumption imposed by the guest, and dynamically adjust
> guest workloads in order to minimize power consumption of the
> host.

For cloud use-case, host side is likely in a better position
to accomplish the task of saving power by migrating VM to
another socket/host to compact idle load. (I've found at least 1
kubernetis tool[1], which does energy monitoring). Perhaps there
are schedulers out there that do that using its data.

> The guest workloads can impose non-negligble power consumption
> loads on non-vCPU threads in QEMU. Without that accounted for,
> any adjustments will be working from (sometimes very) inaccurate
> data.

Perhaps adding one or several energy sensors (ex: some i2c ones),
would let us provide auxiliary threads consumption to guest, and
even make it more granular if necessary (incl. vhost user/out of
process device models or pass-through devices if they have PMU).
It would be better than further muddling vCPUs consumption
estimates with something that doesn't belong there.

> IOW, I think it is right to include non-vCPU threads usage in
> the reported info, as it is still fundamentally part of the
> load that the guest imposes on host pCPUs it is permitted to
> run on.

From what I've read, process energy usage done via RAPL is not
exactly accurate. But there are monitoring tools out there that
use RAPL and other sources to make energy consumption monitoring
more reliable.

Reinventing that wheel and pulling all of the nuances of process
power monitoring inside of QEMU process, needlessly complicates it.
Maybe we should reuse one of existing tools and channel its data
through appropriate QEMU channels (RAPL/emulated PMU counters/...).

Implementing RAPL in pure form though looks fine to me,
so the same tools could use it the same way as on the host
if needed without VM specific quirks.

1) https://github.com/sustainable-computing-io/kepler

> With regards,
> Daniel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 0/3] Add support for the RAPL MSRs series
  2024-10-22 12:46         ` Igor Mammedov
@ 2024-10-22 13:15           ` Daniel P. Berrangé
  2024-10-22 14:16             ` Anthony Harivel
  2024-10-22 15:35             ` Igor Mammedov
  0 siblings, 2 replies; 25+ messages in thread
From: Daniel P. Berrangé @ 2024-10-22 13:15 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Anthony Harivel, pbonzini, mtosatti, qemu-devel, vchundur, rjarry

On Tue, Oct 22, 2024 at 02:46:15PM +0200, Igor Mammedov wrote:
> On Fri, 18 Oct 2024 13:59:34 +0100
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
> > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote:
> > > On Wed, 16 Oct 2024 14:56:39 +0200
> > > "Anthony Harivel" <aharivel@redhat.com> wrote:
> [...]
> 
> > > 
> > > This also leads to a question, if we should account for
> > > not VCPU threads at all. Looking at real hardware, those
> > > MSRs return power usage of CPUs only, and they do not
> > > return consumption from auxiliary system components
> > > (io/memory/...). One can consider non VCPU threads in QEMU
> > > as auxiliary components, so we probably should not to
> > > account for them at all when modeling the same hw feature.
> > > (aka be consistent with what real hw does).  
> > 
> > I understand your POV, but I think that would be a mistake,
> > and would undermine the usefulness of the feature.
> > 
> > The deployment model has a cluster of hosts and guests, all
> > belonging to the same user. The user goal is to measure host
> > power consumption imposed by the guest, and dynamically adjust
> > guest workloads in order to minimize power consumption of the
> > host.
> 
> For cloud use-case, host side is likely in a better position
> to accomplish the task of saving power by migrating VM to
> another socket/host to compact idle load. (I've found at least 1
> kubernetis tool[1], which does energy monitoring). Perhaps there
> are schedulers out there that do that using its data.

The host admin can merely shuffle workloads around, hoping that
a different packing of workloads onto machines, will reduce power
in some aount. You might win a few %, or low 10s of % with this
if you're good at it.

The guest admin can change the way their workload operates to
reduce its inherant power consumption baseline. You could easily
come across ways to win high 10s of % with this. That's why it
is interesting to expose power consumption info to the guest
admin.

IOW, neither makes the other obsolete, both approaches are
desirable.

> > The guest workloads can impose non-negligble power consumption
> > loads on non-vCPU threads in QEMU. Without that accounted for,
> > any adjustments will be working from (sometimes very) inaccurate
> > data.
> 
> Perhaps adding one or several energy sensors (ex: some i2c ones),
> would let us provide auxiliary threads consumption to guest, and
> even make it more granular if necessary (incl. vhost user/out of
> process device models or pass-through devices if they have PMU).
> It would be better than further muddling vCPUs consumption
> estimates with something that doesn't belong there.

There's a tradeoff here in that info directly associated with
backends threads, is effectively exposing private QEMU impl
details as public ABI. IOW, we don't want too fine granularity
here, we need it abstracted sufficiently, that different
backend choices for a given don't change what sensors are
exposed.

I also wonder how existing power monitoring applications
would consume such custom sensors - is there sufficient
standardization in this are that we're not inventing
something totally QEMU specific ?

> > IOW, I think it is right to include non-vCPU threads usage in
> > the reported info, as it is still fundamentally part of the
> > load that the guest imposes on host pCPUs it is permitted to
> > run on.
> 
> 
> From what I've read, process energy usage done via RAPL is not
> exactly accurate. But there are monitoring tools out there that
> use RAPL and other sources to make energy consumption monitoring
> more reliable.
> 
> Reinventing that wheel and pulling all of the nuances of process
> power monitoring inside of QEMU process, needlessly complicates it.
> Maybe we should reuse one of existing tools and channel its data
> through appropriate QEMU channels (RAPL/emulated PMU counters/...).

Note, this feature is already released in QEMU 9.1.0.

> Implementing RAPL in pure form though looks fine to me,
> so the same tools could use it the same way as on the host
> if needed without VM specific quirks.

IMHO the so called "pure" form is misleading to applications, unless
we first provided  some other pratical way to expose the data that
we would be throwing away from RAPL.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 0/3] Add support for the RAPL MSRs series
  2024-10-18 12:25     ` Igor Mammedov
  2024-10-18 12:59       ` Daniel P. Berrangé
@ 2024-10-22 13:49       ` Anthony Harivel
  2024-11-04  9:40         ` Igor Mammedov
  1 sibling, 1 reply; 25+ messages in thread
From: Anthony Harivel @ 2024-10-22 13:49 UTC (permalink / raw)
  To: Igor Mammedov; +Cc: pbonzini, mtosatti, berrange, qemu-devel, vchundur, rjarry

Igor Mammedov, Oct 18, 2024 at 14:25:
> On Wed, 16 Oct 2024 14:56:39 +0200
> "Anthony Harivel" <aharivel@redhat.com> wrote:
>
>> Hi Igor,
>> 
>> Igor Mammedov, Oct 16, 2024 at 13:52:
>> > On Wed, 22 May 2024 17:34:49 +0200
>> > Anthony Harivel <aharivel@redhat.com> wrote:
>> >  
>> >> Dear maintainers, 
>> >> 
>> >> First of all, thank you very much for your review of my patch 
>> >> [1].  
>> >
>> > I've tried to play with this feature and have a few questions about it
>> >  
>> 
>> Thanks for testing this new feature. 
>> 
>> >  1. trying to start with non accessible or not existent socket
>> >         -accel kvm,rapl=on,rapl-helper-socket=/tmp/socket 
>> >     I get:
>> >       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: vmsr socket opening failed
>> >       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: kvm : error RAPL feature requirement not met
>> >     * is it possible to report actual OS error that happened during open/connect,
>> >       instead of unhelpful 'socket opening failed'?
>> >
>> >       What I see in vmsr_open_socket() error is ignored
>> >       and btw it's error leak as well
>> >  
>> 
>> Shame you missed the 6 iterations of that patch that last for a year. 
>> I would have changed that directly !
>> Anyway I take note on that comment and will send a modification.
>> 
>> >     * 2nd line shouldn't be there if the 1st error already present.
>> >
>> >  2.  getting periodic error on console where QEMU has been starter
>> >       # ./qemu-vmsr-helper -k /tmp/sock
>> >      ./qemu-system-x86_64 -snapshot -m 4G -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock rhel90.img  -vnc :0 -cpu host
>> >      and let it run
>> >
>> >       it appears rdmsr works (well, it returns some values at least)
>> >       however there are recurring errors in qemu's stderr(or out)
>> >       
>> >       qemu-system-x86_64: Error opening /proc/2496093/task/2496109/stat
>> >       qemu-system-x86_64: Error opening /proc/2496093/task/2496095/stat
>> >
>> >       My guess it's some temporary threads, that come and go, but still
>> >       they shouldn't cause errors if it's normal operation.
>> >  
>> 
>> There a patch in WIP that change this into a Tracepoint. Maybe you can 
>> SSH to the VM in meanwhile ?
>
> it's just idling VM that doesn't do anything, hence the question.  
>
>> 
>> >       Also on daemon side, I a few times got while guest was running:
>> >         qemu-vmsr-helper: Failed to open /proc at /proc/2496026/task/2496044
>> >         qemu-vmsr-helper: Requested TID not in peer PID: 2496026 2496044
>> >       though I can't reproduce it reliably  
>> 
>> This could happen only when a vCPU thread ID has changed between the 
>> call of a rdmsr throught the socket and the hepler that read the msr.
>> No idea how a vCPU can change TID or shutdown that fast.
>
> I guess it needs to be figured out to decide if it's safe to ignore (and not print error)
> or if it's a genuine error/bug somewhere
>
>> >  3. when starting daemon not as root, it starts 'fine' but later on complains
>> >       qemu-vmsr-helper: Failed to open MSR file at /dev/cpu/0/msr
>> >     perhaps it would be better to fail at start daemon if it doesn't have
>> >     access to necessary files.
>> >  
>> 
>> Right taking a note on that as well.
>> 
>> 
>> >  4. in case #3, guest also fails to start with errors:
>> >       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: can't read any virtual msr
>> >       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: kvm : error RAPL feature requirement not met
>> >      again line #2 is not useful and probably not needed (maybe make it tracepoint)
>> >      and #1 is unhelpful - it would be better if it directed user to check qemu-vmsr-helper
>> >  
>> 
>> I will try to see how to improve that part. 
>> Thanks for your valuable feedback.
>> 
>> >  5. does AMD have similar MSRs that we could use to make this feature complete?
>> >  
>> 
>> Yes but the address are completely different. However, this in my ToDo 
>> list. First I need way more feedback like yours to move on extending 
>> this feature.
>
> If adding AMD's MSRs is not difficult, then I'd make it priority.
> This way users (and libvirt) won't have to deal with 2 different
> feature-sets and decide when to allow this to be turned on depending on host.
>

QEMU needs to know if it runs on Intel or AMD machine in order to choose 
which set of MSR it must read. I did not check how to achieve this at the 
moment but I will when I will work on that.

>> 
>> >  6. What happens to power accounting if host constantly migrates
>> >     vcpus between sockets, are values we are getting still correct/meaningful?
>> >     Or do we need to pin vcpus to get 'accurate' values?
>> >  
>> 
>> It's taken into account during the ratio calculation which socket the 
>> vCPU has just been scheduled. But yes the value are more 'accurate' when 
>> the vCPU is pinned.
>
> in worst case VCPUs might be moved between sockets many times during
> sample window, can you explain how that is accounted for?
>

If one vCPU is moving socket during the sample period then it is 
detected and not taken into account.

That said, if your system is bouncing vCPU back and forth between socket 
then you will experience a lot of caches misses, cpu caches trashes, 
context switches, increase of memory latency (numa issues), etc. This 
will lead to performance degradation and VM performance being very poor. 
Then you should probably fix it. 

> Anyways, it would be better to have some numbers in doc that would
> clarify what kind of accuracy we are talking about (and example
> pinned vs unpinned), or whether unpinned case measures average
> temperature of patients in hospital and we should recommend
> to pin vcpus and everything else.
>

I totally understand that I can add more clarification in the 
documentation that might be obvious for some but not for other. Like 
isolating your VM properly will give better result. 

But I won't give any number. It doesn't make sens. Accuracy is not the 
goal of this feature, it never was and it never will. First of all 
because RAPL is not accurate for power monitoring. You want accuracy? 
Use a Power Metering device. 
You want a reproducible way to compare power energy between 
A and B in order to optimize your software ? Use can use RAPL and so 
this feature that shows good reproducible results.

> Also actual usecase examples for the feature should be mentioned
> in the doc. So users could figure out when they need to enable
> this feature (with attached accuracy numbers). Aka how this
> new feature is good for end users and what they can do with it.
>

Got it. More documentation, use case, examples. 
I will see what can be added to QEMU documentation.


>> >  7. do we have to have a dedicated thread for pooling data from daemon?
>> >
>> >     Can we fetch data from vcpu thread that have accessed msr
>> >     (with some caching and rate limiting access to the daemon)?
>> >  
>> 
>> This feature is revolving around a thread. Please look at the 
>> documentation is not already done:
>> 
>> https://www.qemu.org/docs/master/specs/rapl-msr.html#high-level-implementation
>> 
>> If we only fetch from vCPU thread, we won't have the consumption of the 
>> non-vcpu thread. They are taken into account in the total.
>
> one can collect the same data from vcpu thread as well,
> the bonus part is that we don't have an extra thread
> hanging around and doing work even if guest never asks
> for those MSRs.
>
> This also leads to a question, if we should account for
> not VCPU threads at all. Looking at real hardware, those
> MSRs return power usage of CPUs only, and they do not
> return consumption from auxiliary system components
> (io/memory/...). One can consider non VCPU threads in QEMU
> as auxiliary components, so we probably should not to
> account for them at all when modeling the same hw feature.
> (aka be consistent with what real hw does).
>
>> Thanks again for your feedback. 
>> 
>> Anthony
>> 
>> 
>> >> In this version (v6), I have attempted to address all the problems 
>> >> addressed by Daniel and Paolo during the last review. 
>> >> 
>> >> However, two open questions remains unanswered that would require the 
>> >> attention of a x86 maintainers: 
>> >> 
>> >> 1)Should I move from -kvm to -cpu the rapl feature ? [2]
>> >> 
>> >> 2)Should I already rename to "rapl_vmsr_*" in order to anticipate the 
>> >>   futur TMPI architecture ? [end of 3] 
>> >> 
>> >> Thank you again for your continued guidance. 
>> >> 
>> >> v5 -> v6
>> >> --------
>> >> - Better error consistency in qio_channel_get_peerpid()
>> >> - Memory leak g_strdup_printf/g_build_filename corrected
>> >> - Renaming several struct with "vmsr_*" for better namespace
>> >> - Renamed several struct with "guest_*" for better comprehension
>> >> - Optimization suggerate from Daniel
>> >> - Crash problem solved [4]
>> >> 
>> >> v4 -> v5
>> >> --------
>> >> 
>> >> - correct qio_channel_get_peerpid: return pid = -1 in case of error
>> >> - Vmsr_helper: compile only for x86
>> >> - Vmsr_helper: use qio_channel_read/write_all
>> >> - Vmsr_helper: abandon user/group
>> >> - Vmsr_energy.c: correct all error_report
>> >> - Vmsr thread: compute default socket path only once
>> >> - Vmsr thread: open socket only once
>> >> - Pass relevant QEMU CI
>> >> 
>> >> v3 -> v4
>> >> --------
>> >> 
>> >> - Correct memory leaks with AddressSanitizer  
>> >> - Add sanity check for QEMU and qemu-vmsr-helper for checking if host is 
>> >>   INTEL and if RAPL is activated.
>> >> - Rename poor variables naming for easier comprehension
>> >> - Move code that checks Host before creating the VMSR thread
>> >> - Get rid of libnuma: create function that read sysfs for reading the 
>> >>   Host topology instead
>> >> 
>> >> v2 -> v3
>> >> --------
>> >> 
>> >> - Move all memory allocations from Clib to Glib
>> >> - Compile on *BSD (working on Linux only)
>> >> - No more limitation on the virtual package: each vCPU that belongs to 
>> >>   the same virtual package is giving the same results like expected on 
>> >>   a real CPU.
>> >>   This has been tested topology like:
>> >>      -smp 4,sockets=2
>> >>      -smp 16,sockets=4,cores=2,threads=2
>> >> 
>> >> v1 -> v2
>> >> --------
>> >> 
>> >> - To overcome the CVE-2020-8694 a socket communication is created
>> >>   to a priviliged helper
>> >> - Add the priviliged helper (qemu-vmsr-helper)
>> >> - Add SO_PEERCRED in qio channel socket
>> >> 
>> >> RFC -> v1
>> >> ---------
>> >> 
>> >> - Add vmsr_* in front of all vmsr specific function
>> >> - Change malloc()/calloc()... with all glib equivalent
>> >> - Pre-allocate all dynamic memories when possible
>> >> - Add a Documentation of implementation, limitation and usage
>> >> 
>> >> Best regards,
>> >> Anthony
>> >> 
>> >> [1]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg01570.html
>> >> [2]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg03947.html
>> >> [3]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02350.html
>> >> [4]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02481.html
>> >> 
>> >> Anthony Harivel (3):
>> >>   qio: add support for SO_PEERCRED for socket channel
>> >>   tools: build qemu-vmsr-helper
>> >>   Add support for RAPL MSRs in KVM/Qemu
>> >> 
>> >>  accel/kvm/kvm-all.c                      |  27 ++
>> >>  contrib/systemd/qemu-vmsr-helper.service |  15 +
>> >>  contrib/systemd/qemu-vmsr-helper.socket  |   9 +
>> >>  docs/specs/index.rst                     |   1 +
>> >>  docs/specs/rapl-msr.rst                  | 155 +++++++
>> >>  docs/tools/index.rst                     |   1 +
>> >>  docs/tools/qemu-vmsr-helper.rst          |  89 ++++
>> >>  include/io/channel.h                     |  21 +
>> >>  include/sysemu/kvm_int.h                 |  32 ++
>> >>  io/channel-socket.c                      |  28 ++
>> >>  io/channel.c                             |  13 +
>> >>  meson.build                              |   7 +
>> >>  target/i386/cpu.h                        |   8 +
>> >>  target/i386/kvm/kvm.c                    | 431 +++++++++++++++++-
>> >>  target/i386/kvm/meson.build              |   1 +
>> >>  target/i386/kvm/vmsr_energy.c            | 337 ++++++++++++++
>> >>  target/i386/kvm/vmsr_energy.h            |  99 +++++
>> >>  tools/i386/qemu-vmsr-helper.c            | 530 +++++++++++++++++++++++
>> >>  tools/i386/rapl-msr-index.h              |  28 ++
>> >>  19 files changed, 1831 insertions(+), 1 deletion(-)
>> >>  create mode 100644 contrib/systemd/qemu-vmsr-helper.service
>> >>  create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
>> >>  create mode 100644 docs/specs/rapl-msr.rst
>> >>  create mode 100644 docs/tools/qemu-vmsr-helper.rst
>> >>  create mode 100644 target/i386/kvm/vmsr_energy.c
>> >>  create mode 100644 target/i386/kvm/vmsr_energy.h
>> >>  create mode 100644 tools/i386/qemu-vmsr-helper.c
>> >>  create mode 100644 tools/i386/rapl-msr-index.h
>> >>   
>> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 0/3] Add support for the RAPL MSRs series
  2024-10-22 13:15           ` Daniel P. Berrangé
@ 2024-10-22 14:16             ` Anthony Harivel
  2024-10-22 14:29               ` Daniel P. Berrangé
  2024-11-01 15:09               ` Igor Mammedov
  2024-10-22 15:35             ` Igor Mammedov
  1 sibling, 2 replies; 25+ messages in thread
From: Anthony Harivel @ 2024-10-22 14:16 UTC (permalink / raw)
  To: Daniel P. Berrangé, Igor Mammedov
  Cc: pbonzini, mtosatti, qemu-devel, vchundur, rjarry

Daniel P. Berrangé, Oct 22, 2024 at 15:15:
> On Tue, Oct 22, 2024 at 02:46:15PM +0200, Igor Mammedov wrote:
>> On Fri, 18 Oct 2024 13:59:34 +0100
>> Daniel P. Berrangé <berrange@redhat.com> wrote:
>> 
>> > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote:
>> > > On Wed, 16 Oct 2024 14:56:39 +0200
>> > > "Anthony Harivel" <aharivel@redhat.com> wrote:
>> [...]
>> 
>> > > 
>> > > This also leads to a question, if we should account for
>> > > not VCPU threads at all. Looking at real hardware, those
>> > > MSRs return power usage of CPUs only, and they do not
>> > > return consumption from auxiliary system components
>> > > (io/memory/...). One can consider non VCPU threads in QEMU
>> > > as auxiliary components, so we probably should not to
>> > > account for them at all when modeling the same hw feature.
>> > > (aka be consistent with what real hw does).  
>> > 
>> > I understand your POV, but I think that would be a mistake,
>> > and would undermine the usefulness of the feature.
>> > 
>> > The deployment model has a cluster of hosts and guests, all
>> > belonging to the same user. The user goal is to measure host
>> > power consumption imposed by the guest, and dynamically adjust
>> > guest workloads in order to minimize power consumption of the
>> > host.
>> 
>> For cloud use-case, host side is likely in a better position
>> to accomplish the task of saving power by migrating VM to
>> another socket/host to compact idle load. (I've found at least 1
>> kubernetis tool[1], which does energy monitoring). Perhaps there
>> are schedulers out there that do that using its data.

I also work for Kepler project. I use it to monitor my VM has a black 
box and I used it inside my VM with this feature enable. Thanks to that 
I can optimize the workloads (dpdk application,database,..) inside my VM. 

This is the use-case in NFV deployment and I'm pretty sure this could be 
the use-case of many others.

>
> The host admin can merely shuffle workloads around, hoping that
> a different packing of workloads onto machines, will reduce power
> in some aount. You might win a few %, or low 10s of % with this
> if you're good at it.
>
> The guest admin can change the way their workload operates to
> reduce its inherant power consumption baseline. You could easily
> come across ways to win high 10s of % with this. That's why it
> is interesting to expose power consumption info to the guest
> admin.
>
> IOW, neither makes the other obsolete, both approaches are
> desirable.
>
>> > The guest workloads can impose non-negligble power consumption
>> > loads on non-vCPU threads in QEMU. Without that accounted for,
>> > any adjustments will be working from (sometimes very) inaccurate
>> > data.
>> 
>> Perhaps adding one or several energy sensors (ex: some i2c ones),
>> would let us provide auxiliary threads consumption to guest, and
>> even make it more granular if necessary (incl. vhost user/out of
>> process device models or pass-through devices if they have PMU).
>> It would be better than further muddling vCPUs consumption
>> estimates with something that doesn't belong there.

I'm confused about your statement. Like every software power metering 
tools out is using RAPL (Kepler, Scaphandre, PowerMon, etc) and custom 
sensors would be better than a what everyone is using ?
The goal is not to be accurate. The goal is to be able to compare 
A against B in the same environment and RAPL is given reproducible 
values to do so.
Adding RAPL inside VM makes total sens because you can use tools that 
are already out in the market.

>
> There's a tradeoff here in that info directly associated with
> backends threads, is effectively exposing private QEMU impl
> details as public ABI. IOW, we don't want too fine granularity
> here, we need it abstracted sufficiently, that different
> backend choices for a given don't change what sensors are
> exposed.
>
> I also wonder how existing power monitoring applications
> would consume such custom sensors - is there sufficient
> standardization in this are that we're not inventing
> something totally QEMU specific ?
>
>> > IOW, I think it is right to include non-vCPU threads usage in
>> > the reported info, as it is still fundamentally part of the
>> > load that the guest imposes on host pCPUs it is permitted to
>> > run on.
>> 
>> 
>> From what I've read, process energy usage done via RAPL is not
>> exactly accurate. But there are monitoring tools out there that
>> use RAPL and other sources to make energy consumption monitoring
>> more reliable.
>> 
>> Reinventing that wheel and pulling all of the nuances of process
>> power monitoring inside of QEMU process, needlessly complicates it.
>> Maybe we should reuse one of existing tools and channel its data
>> through appropriate QEMU channels (RAPL/emulated PMU counters/...).
>
> Note, this feature is already released in QEMU 9.1.0.
>
>> Implementing RAPL in pure form though looks fine to me,
>> so the same tools could use it the same way as on the host
>> if needed without VM specific quirks.
>
> IMHO the so called "pure" form is misleading to applications, unless
> we first provided  some other pratical way to expose the data that
> we would be throwing away from RAPL.
>

The other possibility that I've think of is using a 3rd party tool to 
give maybe more "accurate value" to QEMU. 
For example, Kepler could be used to give value for each thread 
of QEMU and so instead of calculating and using the qemu-vmsr-helper, 
each values is transfered on request by QEMU via the UNIX thread that is 
used today between the daemon and QEMU. It's just an idea that I have 
and I don't know if that is acceptable for each project (QEMU and 
Kepler) that would really solve few issues.

> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 0/3] Add support for the RAPL MSRs series
  2024-10-22 14:16             ` Anthony Harivel
@ 2024-10-22 14:29               ` Daniel P. Berrangé
  2024-10-22 14:40                 ` Anthony Harivel
  2024-11-01 15:09               ` Igor Mammedov
  1 sibling, 1 reply; 25+ messages in thread
From: Daniel P. Berrangé @ 2024-10-22 14:29 UTC (permalink / raw)
  To: Anthony Harivel
  Cc: Igor Mammedov, pbonzini, mtosatti, qemu-devel, vchundur, rjarry

On Tue, Oct 22, 2024 at 04:16:36PM +0200, Anthony Harivel wrote:
> Daniel P. Berrangé, Oct 22, 2024 at 15:15:
> > On Tue, Oct 22, 2024 at 02:46:15PM +0200, Igor Mammedov wrote:
> >> On Fri, 18 Oct 2024 13:59:34 +0100
> >> Daniel P. Berrangé <berrange@redhat.com> wrote:
> >> 
> >> > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote:
> >> > > On Wed, 16 Oct 2024 14:56:39 +0200
> >> > > "Anthony Harivel" <aharivel@redhat.com> wrote:
> >> [...]
> >> 
> >> > > 
> >> > > This also leads to a question, if we should account for
> >> > > not VCPU threads at all. Looking at real hardware, those
> >> > > MSRs return power usage of CPUs only, and they do not
> >> > > return consumption from auxiliary system components
> >> > > (io/memory/...). One can consider non VCPU threads in QEMU
> >> > > as auxiliary components, so we probably should not to
> >> > > account for them at all when modeling the same hw feature.
> >> > > (aka be consistent with what real hw does).  
> >> > 
> >> > I understand your POV, but I think that would be a mistake,
> >> > and would undermine the usefulness of the feature.
> >> > 
> >> > The deployment model has a cluster of hosts and guests, all
> >> > belonging to the same user. The user goal is to measure host
> >> > power consumption imposed by the guest, and dynamically adjust
> >> > guest workloads in order to minimize power consumption of the
> >> > host.
> >> 
> >> For cloud use-case, host side is likely in a better position
> >> to accomplish the task of saving power by migrating VM to
> >> another socket/host to compact idle load. (I've found at least 1
> >> kubernetis tool[1], which does energy monitoring). Perhaps there
> >> are schedulers out there that do that using its data.
> 
> I also work for Kepler project. I use it to monitor my VM has a black 
> box and I used it inside my VM with this feature enable. Thanks to that 
> I can optimize the workloads (dpdk application,database,..) inside my VM. 
> 
> This is the use-case in NFV deployment and I'm pretty sure this could be 
> the use-case of many others.
> 
> >
> > The host admin can merely shuffle workloads around, hoping that
> > a different packing of workloads onto machines, will reduce power
> > in some aount. You might win a few %, or low 10s of % with this
> > if you're good at it.
> >
> > The guest admin can change the way their workload operates to
> > reduce its inherant power consumption baseline. You could easily
> > come across ways to win high 10s of % with this. That's why it
> > is interesting to expose power consumption info to the guest
> > admin.
> >
> > IOW, neither makes the other obsolete, both approaches are
> > desirable.
> >
> >> > The guest workloads can impose non-negligble power consumption
> >> > loads on non-vCPU threads in QEMU. Without that accounted for,
> >> > any adjustments will be working from (sometimes very) inaccurate
> >> > data.
> >> 
> >> Perhaps adding one or several energy sensors (ex: some i2c ones),
> >> would let us provide auxiliary threads consumption to guest, and
> >> even make it more granular if necessary (incl. vhost user/out of
> >> process device models or pass-through devices if they have PMU).
> >> It would be better than further muddling vCPUs consumption
> >> estimates with something that doesn't belong there.
> 
> I'm confused about your statement. Like every software power metering 
> tools out is using RAPL (Kepler, Scaphandre, PowerMon, etc) and custom 
> sensors would be better than a what everyone is using ?
> The goal is not to be accurate. The goal is to be able to compare 
> A against B in the same environment and RAPL is given reproducible 
> values to do so.

Be careful with saying "The goal isnot to be accurate", as that's
a very broad statement, and I don't think it is true.


If you're doing A/B comparisons, you *do* need accuracy, in the
sense that if a guest workload config change alters host CPU
power consumption, you want that to be reflected in what the
guest is told about its power usagte.

ie if a change in B moves some power usage from a vCPU thread
to a non-vCPU thread, you don't want that power usage to
disappear from what's reported to the guest. It would give you
the false idea that B is more efficient than A, even if the
non-vCPU thread for B was cosuming x2 what the orignal vCPU
thread was for A.

What I think you don't need is for the absolute magnitude of
the reported power consumption to be a precise match to the
actual power consumption.

ie if A and B are reported as 7 and 9 Watts respectively, it
doesn't matter if the actual consumption was 12 and 15 watts.

The relationship between the two measurements is still valid,
and enables tuning, despite the magnitude being under-reported.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 0/3] Add support for the RAPL MSRs series
  2024-10-22 14:29               ` Daniel P. Berrangé
@ 2024-10-22 14:40                 ` Anthony Harivel
  0 siblings, 0 replies; 25+ messages in thread
From: Anthony Harivel @ 2024-10-22 14:40 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Igor Mammedov, pbonzini, mtosatti, qemu-devel, vchundur, rjarry

Daniel P. Berrangé, Oct 22, 2024 at 16:29:
> On Tue, Oct 22, 2024 at 04:16:36PM +0200, Anthony Harivel wrote:
>> Daniel P. Berrangé, Oct 22, 2024 at 15:15:
>> > On Tue, Oct 22, 2024 at 02:46:15PM +0200, Igor Mammedov wrote:
>> >> On Fri, 18 Oct 2024 13:59:34 +0100
>> >> Daniel P. Berrangé <berrange@redhat.com> wrote:
>> >> 
>> >> > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote:
>> >> > > On Wed, 16 Oct 2024 14:56:39 +0200
>> >> > > "Anthony Harivel" <aharivel@redhat.com> wrote:
>> >> [...]
>> >> 
>> >> > > 
>> >> > > This also leads to a question, if we should account for
>> >> > > not VCPU threads at all. Looking at real hardware, those
>> >> > > MSRs return power usage of CPUs only, and they do not
>> >> > > return consumption from auxiliary system components
>> >> > > (io/memory/...). One can consider non VCPU threads in QEMU
>> >> > > as auxiliary components, so we probably should not to
>> >> > > account for them at all when modeling the same hw feature.
>> >> > > (aka be consistent with what real hw does).  
>> >> > 
>> >> > I understand your POV, but I think that would be a mistake,
>> >> > and would undermine the usefulness of the feature.
>> >> > 
>> >> > The deployment model has a cluster of hosts and guests, all
>> >> > belonging to the same user. The user goal is to measure host
>> >> > power consumption imposed by the guest, and dynamically adjust
>> >> > guest workloads in order to minimize power consumption of the
>> >> > host.
>> >> 
>> >> For cloud use-case, host side is likely in a better position
>> >> to accomplish the task of saving power by migrating VM to
>> >> another socket/host to compact idle load. (I've found at least 1
>> >> kubernetis tool[1], which does energy monitoring). Perhaps there
>> >> are schedulers out there that do that using its data.
>> 
>> I also work for Kepler project. I use it to monitor my VM has a black 
>> box and I used it inside my VM with this feature enable. Thanks to that 
>> I can optimize the workloads (dpdk application,database,..) inside my VM. 
>> 
>> This is the use-case in NFV deployment and I'm pretty sure this could be 
>> the use-case of many others.
>> 
>> >
>> > The host admin can merely shuffle workloads around, hoping that
>> > a different packing of workloads onto machines, will reduce power
>> > in some aount. You might win a few %, or low 10s of % with this
>> > if you're good at it.
>> >
>> > The guest admin can change the way their workload operates to
>> > reduce its inherant power consumption baseline. You could easily
>> > come across ways to win high 10s of % with this. That's why it
>> > is interesting to expose power consumption info to the guest
>> > admin.
>> >
>> > IOW, neither makes the other obsolete, both approaches are
>> > desirable.
>> >
>> >> > The guest workloads can impose non-negligble power consumption
>> >> > loads on non-vCPU threads in QEMU. Without that accounted for,
>> >> > any adjustments will be working from (sometimes very) inaccurate
>> >> > data.
>> >> 
>> >> Perhaps adding one or several energy sensors (ex: some i2c ones),
>> >> would let us provide auxiliary threads consumption to guest, and
>> >> even make it more granular if necessary (incl. vhost user/out of
>> >> process device models or pass-through devices if they have PMU).
>> >> It would be better than further muddling vCPUs consumption
>> >> estimates with something that doesn't belong there.
>> 
>> I'm confused about your statement. Like every software power metering 
>> tools out is using RAPL (Kepler, Scaphandre, PowerMon, etc) and custom 
>> sensors would be better than a what everyone is using ?
>> The goal is not to be accurate. The goal is to be able to compare 
>> A against B in the same environment and RAPL is given reproducible 
>> values to do so.
>
> Be careful with saying "The goal isnot to be accurate", as that's
> a very broad statement, and I don't think it is true.
>
>
> If you're doing A/B comparisons, you *do* need accuracy, in the
> sense that if a guest workload config change alters host CPU
> power consumption, you want that to be reflected in what the
> guest is told about its power usagte.
>
> ie if a change in B moves some power usage from a vCPU thread
> to a non-vCPU thread, you don't want that power usage to
> disappear from what's reported to the guest. It would give you
> the false idea that B is more efficient than A, even if the
> non-vCPU thread for B was cosuming x2 what the orignal vCPU
> thread was for A.
>
> What I think you don't need is for the absolute magnitude of
> the reported power consumption to be a precise match to the
> actual power consumption.
>
> ie if A and B are reported as 7 and 9 Watts respectively, it
> doesn't matter if the actual consumption was 12 and 15 watts.
>

Right, my bad, I agree. When I said "not accurate" I was indeed talking 
about the absolute magnitude of the reported power consumption. 
Like your example above is what I had in mind. 
Sorry for my clumsy shortcut and thanks for clarifying this important point.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 0/3] Add support for the RAPL MSRs series
  2024-10-22 13:15           ` Daniel P. Berrangé
  2024-10-22 14:16             ` Anthony Harivel
@ 2024-10-22 15:35             ` Igor Mammedov
  1 sibling, 0 replies; 25+ messages in thread
From: Igor Mammedov @ 2024-10-22 15:35 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Anthony Harivel, pbonzini, mtosatti, qemu-devel, vchundur, rjarry

On Tue, 22 Oct 2024 14:15:44 +0100
Daniel P. Berrangé <berrange@redhat.com> wrote:

> On Tue, Oct 22, 2024 at 02:46:15PM +0200, Igor Mammedov wrote:
> > On Fri, 18 Oct 2024 13:59:34 +0100
> > Daniel P. Berrangé <berrange@redhat.com> wrote:
> >   
> > > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote:  
> > > > On Wed, 16 Oct 2024 14:56:39 +0200
> > > > "Anthony Harivel" <aharivel@redhat.com> wrote:  
> > [...]
> >   
> > > > 
> > > > This also leads to a question, if we should account for
> > > > not VCPU threads at all. Looking at real hardware, those
> > > > MSRs return power usage of CPUs only, and they do not
> > > > return consumption from auxiliary system components
> > > > (io/memory/...). One can consider non VCPU threads in QEMU
> > > > as auxiliary components, so we probably should not to
> > > > account for them at all when modeling the same hw feature.
> > > > (aka be consistent with what real hw does).    
> > > 
> > > I understand your POV, but I think that would be a mistake,
> > > and would undermine the usefulness of the feature.
> > > 
> > > The deployment model has a cluster of hosts and guests, all
> > > belonging to the same user. The user goal is to measure host
> > > power consumption imposed by the guest, and dynamically adjust
> > > guest workloads in order to minimize power consumption of the
> > > host.  
> > 
> > For cloud use-case, host side is likely in a better position
> > to accomplish the task of saving power by migrating VM to
> > another socket/host to compact idle load. (I've found at least 1
> > kubernetis tool[1], which does energy monitoring). Perhaps there
> > are schedulers out there that do that using its data.  
> 
> The host admin can merely shuffle workloads around, hoping that
> a different packing of workloads onto machines, will reduce power
> in some aount. You might win a few %, or low 10s of % with this
> if you're good at it.

package level savings probably won't make a much of dent (older hw, less impact),
but if one would think about vacating/powering down host it's a bit
different story (it was in my home lab case - trying to
minimize idle consumption of 24/7 systems). But even with
that when switching to newer hardware it might come to the point
of diminishing returns eventually.

> The guest admin can change the way their workload operates to
> reduce its inherant power consumption baseline. You could easily
> come across ways to win high 10s of % with this. That's why it
> is interesting to expose power consumption info to the guest
> admin.

Looking at discussions around Intel's hybrid CPUs, I got
an impression that not userspace nor kernel have enough energy
consumption info to make decent scheduling decision and no _one
really wishes do scheduling manually_ to begin with. That's where
Intel's CPUs with IDT come into the picture to help kernel
somehow bin tasks based on efficiency figures (since CPU knows
exactly how much resources it is using).
But that's relatively new and whether such cpus will stick or
not is still an open question (it makes sense for mobile market,
but for other applications I'd guess time will show).


> IOW, neither makes the other obsolete, both approaches are
> desirable.

no argument here.

> > > The guest workloads can impose non-negligble power consumption
> > > loads on non-vCPU threads in QEMU. Without that accounted for,
> > > any adjustments will be working from (sometimes very) inaccurate
> > > data.  
> > 
> > Perhaps adding one or several energy sensors (ex: some i2c ones),
> > would let us provide auxiliary threads consumption to guest, and
> > even make it more granular if necessary (incl. vhost user/out of
> > process device models or pass-through devices if they have PMU).
> > It would be better than further muddling vCPUs consumption
> > estimates with something that doesn't belong there.  
> 
> There's a tradeoff here in that info directly associated with
> backends threads, is effectively exposing private QEMU impl
> details as public ABI. IOW, we don't want too fine granularity
> here, we need it abstracted sufficiently, that different
> backend choices for a given don't change what sensors are
> exposed.
> 
> I also wonder how existing power monitoring applications
> would consume such custom sensors - is there sufficient
> standardization in this are that we're not inventing
> something totally QEMU specific ?

we can expose them as ACPI power meter devices, to make it
abstract for guest OS (i.e. guest would need only a standard
driver for it) or alternatively model some of real i2c
sensors. But yes, it something that should be explored so
it would work/supported by common tools or the tool of the choice.

> 
> > > IOW, I think it is right to include non-vCPU threads usage in
> > > the reported info, as it is still fundamentally part of the
> > > load that the guest imposes on host pCPUs it is permitted to
> > > run on.  
> > 
> > 
> > From what I've read, process energy usage done via RAPL is not
> > exactly accurate. But there are monitoring tools out there that
> > use RAPL and other sources to make energy consumption monitoring
> > more reliable.
> > 
> > Reinventing that wheel and pulling all of the nuances of process
> > power monitoring inside of QEMU process, needlessly complicates it.
> > Maybe we should reuse one of existing tools and channel its data
> > through appropriate QEMU channels (RAPL/emulated PMU counters/...).  
> 
> Note, this feature is already released in QEMU 9.1.0.

that doesn't preclude us from improving impl. details 
/i.e. what tasks qemu does and what is upto backend (external daemon)/
though. Incl. changing backend if it that would do a better job
for in the end (with a benefit that it's mostly maintained by another project).

> > Implementing RAPL in pure form though looks fine to me,
> > so the same tools could use it the same way as on the host
> > if needed without VM specific quirks.  
> 
> IMHO the so called "pure" form is misleading to applications, unless
> we first provided  some other pratical way to expose the data that
> we would be throwing away from RAPL.
I don't argue that data should be thrown away. But just that we should
provide them some other way instead of vCPU RAPL interface. And not
confuse host's pCPU with vCPUs.

PS:
Taking example above that aux threads are inherent pCPU load and
stretch it in to host side. Then one can say pCPU inherently incurs
power draw on other system components with some workloads, so RAPL MSRs
should include that load as well.
But yep, at this point turns into a pointless bike-shedding.

PS2:
in nutshell, my questions are:
 * should we expose aux threads as other power meter device
 * would it be better to reuse/integrate with existing (hopefully mature)
   projects for monitoring on host side instead of duplicating a subset
   of capabilities in QEMU specific helper and then maintain it.

> With regards,
> Daniel



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 0/3] Add support for the RAPL MSRs series
  2024-10-22 14:16             ` Anthony Harivel
  2024-10-22 14:29               ` Daniel P. Berrangé
@ 2024-11-01 15:09               ` Igor Mammedov
  2024-11-02  9:32                 ` Anthony Harivel
  1 sibling, 1 reply; 25+ messages in thread
From: Igor Mammedov @ 2024-11-01 15:09 UTC (permalink / raw)
  To: Anthony Harivel
  Cc: Daniel P. Berrangé, pbonzini, mtosatti, qemu-devel, vchundur,
	rjarry, nathans, kenj, chorn, sunyanan.choochotkaew1,
	vibhu.sharma2929

On Tue, 22 Oct 2024 16:16:36 +0200
"Anthony Harivel" <aharivel@redhat.com> wrote:

> Daniel P. Berrangé, Oct 22, 2024 at 15:15:
> > On Tue, Oct 22, 2024 at 02:46:15PM +0200, Igor Mammedov wrote:  
> >> On Fri, 18 Oct 2024 13:59:34 +0100
> >> Daniel P. Berrangé <berrange@redhat.com> wrote:
> >>   
> >> > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote:  
> >> > > On Wed, 16 Oct 2024 14:56:39 +0200
> >> > > "Anthony Harivel" <aharivel@redhat.com> wrote:  
> >> [...]
> >>   
> >> > > 
> >> > > This also leads to a question, if we should account for
> >> > > not VCPU threads at all. Looking at real hardware, those
> >> > > MSRs return power usage of CPUs only, and they do not
> >> > > return consumption from auxiliary system components
> >> > > (io/memory/...). One can consider non VCPU threads in QEMU
> >> > > as auxiliary components, so we probably should not to
> >> > > account for them at all when modeling the same hw feature.
> >> > > (aka be consistent with what real hw does).    
> >> > 
> >> > I understand your POV, but I think that would be a mistake,
> >> > and would undermine the usefulness of the feature.
> >> > 
> >> > The deployment model has a cluster of hosts and guests, all
> >> > belonging to the same user. The user goal is to measure host
> >> > power consumption imposed by the guest, and dynamically adjust
> >> > guest workloads in order to minimize power consumption of the
> >> > host.  
> >> 
> >> For cloud use-case, host side is likely in a better position
> >> to accomplish the task of saving power by migrating VM to
> >> another socket/host to compact idle load. (I've found at least 1
> >> kubernetis tool[1], which does energy monitoring). Perhaps there
> >> are schedulers out there that do that using its data.  
> 
> I also work for Kepler project. I use it to monitor my VM has a black 
> box and I used it inside my VM with this feature enable. Thanks to that 
> I can optimize the workloads (dpdk application,database,..) inside my VM. 
> 
> This is the use-case in NFV deployment and I'm pretty sure this could be 
> the use-case of many others.
> 
> >
> > The host admin can merely shuffle workloads around, hoping that
> > a different packing of workloads onto machines, will reduce power
> > in some aount. You might win a few %, or low 10s of % with this
> > if you're good at it.
> >
> > The guest admin can change the way their workload operates to
> > reduce its inherant power consumption baseline. You could easily
> > come across ways to win high 10s of % with this. That's why it
> > is interesting to expose power consumption info to the guest
> > admin.
> >
> > IOW, neither makes the other obsolete, both approaches are
> > desirable.
> >  
> >> > The guest workloads can impose non-negligble power consumption
> >> > loads on non-vCPU threads in QEMU. Without that accounted for,
> >> > any adjustments will be working from (sometimes very) inaccurate
> >> > data.  
> >> 
> >> Perhaps adding one or several energy sensors (ex: some i2c ones),
> >> would let us provide auxiliary threads consumption to guest, and
> >> even make it more granular if necessary (incl. vhost user/out of
> >> process device models or pass-through devices if they have PMU).
> >> It would be better than further muddling vCPUs consumption
> >> estimates with something that doesn't belong there.  
> 
> I'm confused about your statement. Like every software power metering 
> tools out is using RAPL (Kepler, Scaphandre, PowerMon, etc) and custom 
> sensors would be better than a what everyone is using ?

RAPL is used to measure CPU/DRAM/maybe GPU domains.
see my other reply to Daniel RAPL + aux
 (https://www.mail-archive.com/qemu-devel@nongnu.org/msg1072593.html)
My point wrt RAPL is: CPU domain on host and inside guest
should be doing the same thing, i.e. report only package/core
consumption of virtual CPU and nothing else (non vCPU induced load
should not be included in CPU domain).

For non vCPU consumption, we should do the same as bare-metal,
i.e. add power sensors where necessary. As minimum we can add
a system power meter sensor, which could account for total
energy draw (and that can include not only QEMU aux threads,
but also for other related processes (aka process handling dpdk NIC,
or other vhost user backend)).
Individual per device sensors also a possibility in the future
(i.e per NIC) is we can find a suitable sensor on host to derive
guest value.

[...]

> Adding RAPL inside VM makes total sens because you can use tools that 
> are already out in the market.
no disagreement here.

Given the topic is relatively new, the tooling mostly concentrates on
RAPL as most available sensor. But some tools can pull energy values
from other sources, we surely can teach them to pull values from
a sensor(s) we'd want to add to QEMU (i.e. for an easy start borrow
sensor handling from lm_sensors). I'd pick acpi power meter as
a possible candidate for it is being guest OS agnostic and
we can attach it to anything in machine tree.

> > There's a tradeoff here in that info directly associated with
> > backends threads, is effectively exposing private QEMU impl
> > details as public ABI. IOW, we don't want too fine granularity
> > here, we need it abstracted sufficiently, that different
> > backend choices for a given don't change what sensors are
> > exposed.
> >
> > I also wonder how existing power monitoring applications
> > would consume such custom sensors - is there sufficient
> > standardization in this are that we're not inventing
> > something totally QEMU specific ?
> >  
> >> > IOW, I think it is right to include non-vCPU threads usage in
> >> > the reported info, as it is still fundamentally part of the
> >> > load that the guest imposes on host pCPUs it is permitted to
> >> > run on.  
> >> 
> >> 
> >> From what I've read, process energy usage done via RAPL is not
> >> exactly accurate. But there are monitoring tools out there that
> >> use RAPL and other sources to make energy consumption monitoring
> >> more reliable.
> >> 
> >> Reinventing that wheel and pulling all of the nuances of process
> >> power monitoring inside of QEMU process, needlessly complicates it.
> >> Maybe we should reuse one of existing tools and channel its data
> >> through appropriate QEMU channels (RAPL/emulated PMU counters/...).  
> >
> > Note, this feature is already released in QEMU 9.1.0.
> >  
> >> Implementing RAPL in pure form though looks fine to me,
> >> so the same tools could use it the same way as on the host
> >> if needed without VM specific quirks.  
> >
> > IMHO the so called "pure" form is misleading to applications, unless
> > we first provided  some other pratical way to expose the data that
> > we would be throwing away from RAPL.
> >  
> 
> The other possibility that I've think of is using a 3rd party tool to 
> give maybe more "accurate value" to QEMU. 
> For example, Kepler could be used to give value for each thread 
> of QEMU and so instead of calculating and using the qemu-vmsr-helper, 
> each values is transfered on request by QEMU via the UNIX thread that is 
> used today between the daemon and QEMU. It's just an idea that I have 
> and I don't know if that is acceptable for each project (QEMU and 
> Kepler) that would really solve few issues.

From QEMU point of view, it would be fine to get values from external
process and just proxy them to guest (preferably without any massaging).

Also on QEMU side, I'd suggest to split current monolith functionality
in 2 parts: frontend (KVM MSR interface for starters) and backend object
(created with -object CLI option) that will handle communication
with an external daemon. That way QEMU would be able easily change/add
different frontend and backend options (ex: add frontend for RAPL
with TCG accel, add backend for Kelper or other project(s)
down the road). (it would be good to make this split even for
qemu-vmsr-helper). (if you are interested, I can guide you wrt
QEMU side of the question).

PS:
As for other projects we probably should ask if they are open to an idea.
They definitely would need some patches for per thread accounting,
and maybe for some API to talk with external users (but the later
might exist and it might be better for QEMU to adopt it (here QEMU
backend object might help as translator of existing protocol to
QEMU specific internals).
The point is QEMU won't have to reinvent wheel, and other projects
will get more exposure/user-base.

On top of the projects, you've already pointed out for possible
integration with. I could add pmdadenki (CCed few authors) which
some distros are shipping/using.

> > With regards,
> > Daniel
> > -- 
> > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|  
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 0/3] Add support for the RAPL MSRs series
  2024-11-01 15:09               ` Igor Mammedov
@ 2024-11-02  9:32                 ` Anthony Harivel
  2024-11-04  9:49                   ` Igor Mammedov
  0 siblings, 1 reply; 25+ messages in thread
From: Anthony Harivel @ 2024-11-02  9:32 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Daniel P. Berrangé, pbonzini, mtosatti, qemu-devel, vchundur,
	rjarry, nathans, kenj, chorn, sunyanan.choochotkaew1,
	vibhu.sharma2929


Hi Igor,

Igor Mammedov, Nov 01, 2024 at 16:09:
> On Tue, 22 Oct 2024 16:16:36 +0200
> "Anthony Harivel" <aharivel@redhat.com> wrote:
>
>> Daniel P. Berrangé, Oct 22, 2024 at 15:15:
>> > On Tue, Oct 22, 2024 at 02:46:15PM +0200, Igor Mammedov wrote:  
>> >> On Fri, 18 Oct 2024 13:59:34 +0100
>> >> Daniel P. Berrangé <berrange@redhat.com> wrote:
>> >>   
>> >> > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote:  
>> >> > > On Wed, 16 Oct 2024 14:56:39 +0200
>> >> > > "Anthony Harivel" <aharivel@redhat.com> wrote:  
>> >> [...]
>> >>   
>> >> > > 
>> >> > > This also leads to a question, if we should account for
>> >> > > not VCPU threads at all. Looking at real hardware, those
>> >> > > MSRs return power usage of CPUs only, and they do not
>> >> > > return consumption from auxiliary system components
>> >> > > (io/memory/...). One can consider non VCPU threads in QEMU
>> >> > > as auxiliary components, so we probably should not to
>> >> > > account for them at all when modeling the same hw feature.
>> >> > > (aka be consistent with what real hw does).    
>> >> > 
>> >> > I understand your POV, but I think that would be a mistake,
>> >> > and would undermine the usefulness of the feature.
>> >> > 
>> >> > The deployment model has a cluster of hosts and guests, all
>> >> > belonging to the same user. The user goal is to measure host
>> >> > power consumption imposed by the guest, and dynamically adjust
>> >> > guest workloads in order to minimize power consumption of the
>> >> > host.  
>> >> 
>> >> For cloud use-case, host side is likely in a better position
>> >> to accomplish the task of saving power by migrating VM to
>> >> another socket/host to compact idle load. (I've found at least 1
>> >> kubernetis tool[1], which does energy monitoring). Perhaps there
>> >> are schedulers out there that do that using its data.  
>> 
>> I also work for Kepler project. I use it to monitor my VM has a black 
>> box and I used it inside my VM with this feature enable. Thanks to that 
>> I can optimize the workloads (dpdk application,database,..) inside my VM. 
>> 
>> This is the use-case in NFV deployment and I'm pretty sure this could be 
>> the use-case of many others.
>> 
>> >
>> > The host admin can merely shuffle workloads around, hoping that
>> > a different packing of workloads onto machines, will reduce power
>> > in some aount. You might win a few %, or low 10s of % with this
>> > if you're good at it.
>> >
>> > The guest admin can change the way their workload operates to
>> > reduce its inherant power consumption baseline. You could easily
>> > come across ways to win high 10s of % with this. That's why it
>> > is interesting to expose power consumption info to the guest
>> > admin.
>> >
>> > IOW, neither makes the other obsolete, both approaches are
>> > desirable.
>> >  
>> >> > The guest workloads can impose non-negligble power consumption
>> >> > loads on non-vCPU threads in QEMU. Without that accounted for,
>> >> > any adjustments will be working from (sometimes very) inaccurate
>> >> > data.  
>> >> 
>> >> Perhaps adding one or several energy sensors (ex: some i2c ones),
>> >> would let us provide auxiliary threads consumption to guest, and
>> >> even make it more granular if necessary (incl. vhost user/out of
>> >> process device models or pass-through devices if they have PMU).
>> >> It would be better than further muddling vCPUs consumption
>> >> estimates with something that doesn't belong there.  
>> 
>> I'm confused about your statement. Like every software power metering 
>> tools out is using RAPL (Kepler, Scaphandre, PowerMon, etc) and custom 
>> sensors would be better than a what everyone is using ?
>
> RAPL is used to measure CPU/DRAM/maybe GPU domains.
> see my other reply to Daniel RAPL + aux
>  (https://www.mail-archive.com/qemu-devel@nongnu.org/msg1072593.html)
> My point wrt RAPL is: CPU domain on host and inside guest
> should be doing the same thing, i.e. report only package/core
> consumption of virtual CPU and nothing else (non vCPU induced load
> should not be included in CPU domain).
>
> For non vCPU consumption, we should do the same as bare-metal,
> i.e. add power sensors where necessary. As minimum we can add
> a system power meter sensor, which could account for total
> energy draw (and that can include not only QEMU aux threads,
> but also for other related processes (aka process handling dpdk NIC,
> or other vhost user backend)).
> Individual per device sensors also a possibility in the future
> (i.e per NIC) is we can find a suitable sensor on host to derive
> guest value.
>
> [...]
>
>> Adding RAPL inside VM makes total sens because you can use tools that 
>> are already out in the market.
> no disagreement here.
>
> Given the topic is relatively new, the tooling mostly concentrates on
> RAPL as most available sensor. But some tools can pull energy values
> from other sources, we surely can teach them to pull values from
> a sensor(s) we'd want to add to QEMU (i.e. for an easy start borrow
> sensor handling from lm_sensors). I'd pick acpi power meter as
> a possible candidate for it is being guest OS agnostic and
> we can attach it to anything in machine tree.
>
>> > There's a tradeoff here in that info directly associated with
>> > backends threads, is effectively exposing private QEMU impl
>> > details as public ABI. IOW, we don't want too fine granularity
>> > here, we need it abstracted sufficiently, that different
>> > backend choices for a given don't change what sensors are
>> > exposed.
>> >
>> > I also wonder how existing power monitoring applications
>> > would consume such custom sensors - is there sufficient
>> > standardization in this are that we're not inventing
>> > something totally QEMU specific ?
>> >  
>> >> > IOW, I think it is right to include non-vCPU threads usage in
>> >> > the reported info, as it is still fundamentally part of the
>> >> > load that the guest imposes on host pCPUs it is permitted to
>> >> > run on.  
>> >> 
>> >> 
>> >> From what I've read, process energy usage done via RAPL is not
>> >> exactly accurate. But there are monitoring tools out there that
>> >> use RAPL and other sources to make energy consumption monitoring
>> >> more reliable.
>> >> 
>> >> Reinventing that wheel and pulling all of the nuances of process
>> >> power monitoring inside of QEMU process, needlessly complicates it.
>> >> Maybe we should reuse one of existing tools and channel its data
>> >> through appropriate QEMU channels (RAPL/emulated PMU counters/...).  
>> >
>> > Note, this feature is already released in QEMU 9.1.0.
>> >  
>> >> Implementing RAPL in pure form though looks fine to me,
>> >> so the same tools could use it the same way as on the host
>> >> if needed without VM specific quirks.  
>> >
>> > IMHO the so called "pure" form is misleading to applications, unless
>> > we first provided  some other pratical way to expose the data that
>> > we would be throwing away from RAPL.
>> >  
>> 
>> The other possibility that I've think of is using a 3rd party tool to 
>> give maybe more "accurate value" to QEMU. 
>> For example, Kepler could be used to give value for each thread 
>> of QEMU and so instead of calculating and using the qemu-vmsr-helper, 
>> each values is transfered on request by QEMU via the UNIX thread that is 
>> used today between the daemon and QEMU. It's just an idea that I have 
>> and I don't know if that is acceptable for each project (QEMU and 
>> Kepler) that would really solve few issues.
>
> From QEMU point of view, it would be fine to get values from external
> process and just proxy them to guest (preferably without any massaging).
>
> Also on QEMU side, I'd suggest to split current monolith functionality
> in 2 parts: frontend (KVM MSR interface for starters) and backend object
> (created with -object CLI option) that will handle communication
> with an external daemon. That way QEMU would be able easily change/add
> different frontend and backend options (ex: add frontend for RAPL
> with TCG accel, add backend for Kelper or other project(s)
> down the road). (it would be good to make this split even for
> qemu-vmsr-helper). (if you are interested, I can guide you wrt
> QEMU side of the question).
>
> PS:
> As for other projects we probably should ask if they are open to an idea.
> They definitely would need some patches for per thread accounting,
> and maybe for some API to talk with external users (but the later
> might exist and it might be better for QEMU to adopt it (here QEMU
> backend object might help as translator of existing protocol to
> QEMU specific internals).
> The point is QEMU won't have to reinvent wheel, and other projects
> will get more exposure/user-base.
>
> On top of the projects, you've already pointed out for possible
> integration with. I could add pmdadenki (CCed few authors) which
> some distros are shipping/using.
>

I think you have a fair amount of ideas and opinions on how to handle the 
RAPL in QEMU and that's really good for improving the features. 

What I would really like is to have Paolo's opinions on all of that. When 
I started working on the subject I talked to him several time and we 
agreed on the current implementation. 

Not that I disagree with all you said, to the contrary, but the amount 
of change is quite significant and it would be very annoying if results 
of this work doesn't make upstream because of Y & X.

Let's see if we have more opinions from the people in the loop as well.

Thanks for feedback.

Anthony



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 0/3] Add support for the RAPL MSRs series
  2024-10-22 13:49       ` Anthony Harivel
@ 2024-11-04  9:40         ` Igor Mammedov
  0 siblings, 0 replies; 25+ messages in thread
From: Igor Mammedov @ 2024-11-04  9:40 UTC (permalink / raw)
  To: Anthony Harivel
  Cc: pbonzini, mtosatti, berrange, qemu-devel, vchundur, rjarry

On Tue, 22 Oct 2024 15:49:33 +0200
"Anthony Harivel" <aharivel@redhat.com> wrote:

> Igor Mammedov, Oct 18, 2024 at 14:25:
> > On Wed, 16 Oct 2024 14:56:39 +0200
> > "Anthony Harivel" <aharivel@redhat.com> wrote:
> >  
> >> Hi Igor,
> >> 
> >> Igor Mammedov, Oct 16, 2024 at 13:52:  
> >> > On Wed, 22 May 2024 17:34:49 +0200
> >> > Anthony Harivel <aharivel@redhat.com> wrote:
> >> >    
> >> >> Dear maintainers, 
> >> >> 
> >> >> First of all, thank you very much for your review of my patch 
> >> >> [1].    
> >> >
> >> > I've tried to play with this feature and have a few questions about it
> >> >    
> >> 
> >> Thanks for testing this new feature. 
> >>   
> >> >  1. trying to start with non accessible or not existent socket
> >> >         -accel kvm,rapl=on,rapl-helper-socket=/tmp/socket 
> >> >     I get:
> >> >       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: vmsr socket opening failed
> >> >       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: kvm : error RAPL feature requirement not met
> >> >     * is it possible to report actual OS error that happened during open/connect,
> >> >       instead of unhelpful 'socket opening failed'?
> >> >
> >> >       What I see in vmsr_open_socket() error is ignored
> >> >       and btw it's error leak as well
> >> >    
> >> 
> >> Shame you missed the 6 iterations of that patch that last for a year. 
> >> I would have changed that directly !
> >> Anyway I take note on that comment and will send a modification.
> >>   
> >> >     * 2nd line shouldn't be there if the 1st error already present.
> >> >
> >> >  2.  getting periodic error on console where QEMU has been starter
> >> >       # ./qemu-vmsr-helper -k /tmp/sock
> >> >      ./qemu-system-x86_64 -snapshot -m 4G -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock rhel90.img  -vnc :0 -cpu host
> >> >      and let it run
> >> >
> >> >       it appears rdmsr works (well, it returns some values at least)
> >> >       however there are recurring errors in qemu's stderr(or out)
> >> >       
> >> >       qemu-system-x86_64: Error opening /proc/2496093/task/2496109/stat
> >> >       qemu-system-x86_64: Error opening /proc/2496093/task/2496095/stat
> >> >
> >> >       My guess it's some temporary threads, that come and go, but still
> >> >       they shouldn't cause errors if it's normal operation.
> >> >    
> >> 
> >> There a patch in WIP that change this into a Tracepoint. Maybe you can 
> >> SSH to the VM in meanwhile ?  
> >
> > it's just idling VM that doesn't do anything, hence the question.  
> >  
> >>   
> >> >       Also on daemon side, I a few times got while guest was running:
> >> >         qemu-vmsr-helper: Failed to open /proc at /proc/2496026/task/2496044
> >> >         qemu-vmsr-helper: Requested TID not in peer PID: 2496026 2496044
> >> >       though I can't reproduce it reliably    
> >> 
> >> This could happen only when a vCPU thread ID has changed between the 
> >> call of a rdmsr throught the socket and the hepler that read the msr.
> >> No idea how a vCPU can change TID or shutdown that fast.  
> >
> > I guess it needs to be figured out to decide if it's safe to ignore (and not print error)
> > or if it's a genuine error/bug somewhere
> >  
> >> >  3. when starting daemon not as root, it starts 'fine' but later on complains
> >> >       qemu-vmsr-helper: Failed to open MSR file at /dev/cpu/0/msr
> >> >     perhaps it would be better to fail at start daemon if it doesn't have
> >> >     access to necessary files.
> >> >    
> >> 
> >> Right taking a note on that as well.
> >> 
> >>   
> >> >  4. in case #3, guest also fails to start with errors:
> >> >       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: can't read any virtual msr
> >> >       qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: kvm : error RAPL feature requirement not met
> >> >      again line #2 is not useful and probably not needed (maybe make it tracepoint)
> >> >      and #1 is unhelpful - it would be better if it directed user to check qemu-vmsr-helper
> >> >    
> >> 
> >> I will try to see how to improve that part. 
> >> Thanks for your valuable feedback.
> >>   
> >> >  5. does AMD have similar MSRs that we could use to make this feature complete?
> >> >    
> >> 
> >> Yes but the address are completely different. However, this in my ToDo 
> >> list. First I need way more feedback like yours to move on extending 
> >> this feature.  
> >
> > If adding AMD's MSRs is not difficult, then I'd make it priority.
> > This way users (and libvirt) won't have to deal with 2 different
> > feature-sets and decide when to allow this to be turned on depending on host.
> >  
> 
> QEMU needs to know if it runs on Intel or AMD machine in order to choose 
> which set of MSR it must read. I did not check how to achieve this at the 
> moment but I will when I will work on that.

talking to daemon in terms of power per pkg, e.t.c., we won't care
which MRSs go over wire between daemon and QEMU. Then QEMU can map
that to relevant MSRs (based on cpumodel) internally.

> 
> >>   
> >> >  6. What happens to power accounting if host constantly migrates
> >> >     vcpus between sockets, are values we are getting still correct/meaningful?
> >> >     Or do we need to pin vcpus to get 'accurate' values?
> >> >    
> >> 
> >> It's taken into account during the ratio calculation which socket the 
> >> vCPU has just been scheduled. But yes the value are more 'accurate' when 
> >> the vCPU is pinned.  
> >
> > in worst case VCPUs might be moved between sockets many times during
> > sample window, can you explain how that is accounted for?
> >  
> 
> If one vCPU is moving socket during the sample period then it is 
> detected and not taken into account.
> 
> That said, if your system is bouncing vCPU back and forth between socket 
> then you will experience a lot of caches misses, cpu caches trashes, 
> context switches, increase of memory latency (numa issues), etc. This 
> will lead to performance degradation and VM performance being very poor. 
> Then you should probably fix it.

yep, it's bad config, but typical for overcommit scenario.
if we can't get correct measurement in this case,
then at least printing a warning once could be nice.

(it would be better to refuse starting without vCPU pinning,
but given pinning isn't done by QEMU, I don't see a way to do that.) 

> > Anyways, it would be better to have some numbers in doc that would
> > clarify what kind of accuracy we are talking about (and example
> > pinned vs unpinned), or whether unpinned case measures average
> > temperature of patients in hospital and we should recommend
> > to pin vcpus and everything else.
> >  
> 
> I totally understand that I can add more clarification in the 
> documentation that might be obvious for some but not for other. Like 
> isolating your VM properly will give better result. 
> 
> But I won't give any number. It doesn't make sens. Accuracy is not the 
> goal of this feature, it never was and it never will. First of all 
> because RAPL is not accurate for power monitoring. You want accuracy? 
> Use a Power Metering device. 
> You want a reproducible way to compare power energy between 
> A and B in order to optimize your software ? Use can use RAPL and so 
> this feature that shows good reproducible results.
> 
> > Also actual usecase examples for the feature should be mentioned
> > in the doc. So users could figure out when they need to enable
> > this feature (with attached accuracy numbers). Aka how this
> > new feature is good for end users and what they can do with it.
> >  
> 
> Got it. More documentation, use case, examples. 
> I will see what can be added to QEMU documentation.
> 
> 
> >> >  7. do we have to have a dedicated thread for pooling data from daemon?
> >> >
> >> >     Can we fetch data from vcpu thread that have accessed msr
> >> >     (with some caching and rate limiting access to the daemon)?
> >> >    
> >> 
> >> This feature is revolving around a thread. Please look at the 
> >> documentation is not already done:
> >> 
> >> https://www.qemu.org/docs/master/specs/rapl-msr.html#high-level-implementation
> >> 
> >> If we only fetch from vCPU thread, we won't have the consumption of the 
> >> non-vcpu thread. They are taken into account in the total.  
> >
> > one can collect the same data from vcpu thread as well,
> > the bonus part is that we don't have an extra thread
> > hanging around and doing work even if guest never asks
> > for those MSRs.
> >
> > This also leads to a question, if we should account for
> > not VCPU threads at all. Looking at real hardware, those
> > MSRs return power usage of CPUs only, and they do not
> > return consumption from auxiliary system components
> > (io/memory/...). One can consider non VCPU threads in QEMU
> > as auxiliary components, so we probably should not to
> > account for them at all when modeling the same hw feature.
> > (aka be consistent with what real hw does).
> >  
> >> Thanks again for your feedback. 
> >> 
> >> Anthony
> >> 
> >>   
> >> >> In this version (v6), I have attempted to address all the problems 
> >> >> addressed by Daniel and Paolo during the last review. 
> >> >> 
> >> >> However, two open questions remains unanswered that would require the 
> >> >> attention of a x86 maintainers: 
> >> >> 
> >> >> 1)Should I move from -kvm to -cpu the rapl feature ? [2]
> >> >> 
> >> >> 2)Should I already rename to "rapl_vmsr_*" in order to anticipate the 
> >> >>   futur TMPI architecture ? [end of 3] 
> >> >> 
> >> >> Thank you again for your continued guidance. 
> >> >> 
> >> >> v5 -> v6
> >> >> --------
> >> >> - Better error consistency in qio_channel_get_peerpid()
> >> >> - Memory leak g_strdup_printf/g_build_filename corrected
> >> >> - Renaming several struct with "vmsr_*" for better namespace
> >> >> - Renamed several struct with "guest_*" for better comprehension
> >> >> - Optimization suggerate from Daniel
> >> >> - Crash problem solved [4]
> >> >> 
> >> >> v4 -> v5
> >> >> --------
> >> >> 
> >> >> - correct qio_channel_get_peerpid: return pid = -1 in case of error
> >> >> - Vmsr_helper: compile only for x86
> >> >> - Vmsr_helper: use qio_channel_read/write_all
> >> >> - Vmsr_helper: abandon user/group
> >> >> - Vmsr_energy.c: correct all error_report
> >> >> - Vmsr thread: compute default socket path only once
> >> >> - Vmsr thread: open socket only once
> >> >> - Pass relevant QEMU CI
> >> >> 
> >> >> v3 -> v4
> >> >> --------
> >> >> 
> >> >> - Correct memory leaks with AddressSanitizer  
> >> >> - Add sanity check for QEMU and qemu-vmsr-helper for checking if host is 
> >> >>   INTEL and if RAPL is activated.
> >> >> - Rename poor variables naming for easier comprehension
> >> >> - Move code that checks Host before creating the VMSR thread
> >> >> - Get rid of libnuma: create function that read sysfs for reading the 
> >> >>   Host topology instead
> >> >> 
> >> >> v2 -> v3
> >> >> --------
> >> >> 
> >> >> - Move all memory allocations from Clib to Glib
> >> >> - Compile on *BSD (working on Linux only)
> >> >> - No more limitation on the virtual package: each vCPU that belongs to 
> >> >>   the same virtual package is giving the same results like expected on 
> >> >>   a real CPU.
> >> >>   This has been tested topology like:
> >> >>      -smp 4,sockets=2
> >> >>      -smp 16,sockets=4,cores=2,threads=2
> >> >> 
> >> >> v1 -> v2
> >> >> --------
> >> >> 
> >> >> - To overcome the CVE-2020-8694 a socket communication is created
> >> >>   to a priviliged helper
> >> >> - Add the priviliged helper (qemu-vmsr-helper)
> >> >> - Add SO_PEERCRED in qio channel socket
> >> >> 
> >> >> RFC -> v1
> >> >> ---------
> >> >> 
> >> >> - Add vmsr_* in front of all vmsr specific function
> >> >> - Change malloc()/calloc()... with all glib equivalent
> >> >> - Pre-allocate all dynamic memories when possible
> >> >> - Add a Documentation of implementation, limitation and usage
> >> >> 
> >> >> Best regards,
> >> >> Anthony
> >> >> 
> >> >> [1]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg01570.html
> >> >> [2]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg03947.html
> >> >> [3]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02350.html
> >> >> [4]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02481.html
> >> >> 
> >> >> Anthony Harivel (3):
> >> >>   qio: add support for SO_PEERCRED for socket channel
> >> >>   tools: build qemu-vmsr-helper
> >> >>   Add support for RAPL MSRs in KVM/Qemu
> >> >> 
> >> >>  accel/kvm/kvm-all.c                      |  27 ++
> >> >>  contrib/systemd/qemu-vmsr-helper.service |  15 +
> >> >>  contrib/systemd/qemu-vmsr-helper.socket  |   9 +
> >> >>  docs/specs/index.rst                     |   1 +
> >> >>  docs/specs/rapl-msr.rst                  | 155 +++++++
> >> >>  docs/tools/index.rst                     |   1 +
> >> >>  docs/tools/qemu-vmsr-helper.rst          |  89 ++++
> >> >>  include/io/channel.h                     |  21 +
> >> >>  include/sysemu/kvm_int.h                 |  32 ++
> >> >>  io/channel-socket.c                      |  28 ++
> >> >>  io/channel.c                             |  13 +
> >> >>  meson.build                              |   7 +
> >> >>  target/i386/cpu.h                        |   8 +
> >> >>  target/i386/kvm/kvm.c                    | 431 +++++++++++++++++-
> >> >>  target/i386/kvm/meson.build              |   1 +
> >> >>  target/i386/kvm/vmsr_energy.c            | 337 ++++++++++++++
> >> >>  target/i386/kvm/vmsr_energy.h            |  99 +++++
> >> >>  tools/i386/qemu-vmsr-helper.c            | 530 +++++++++++++++++++++++
> >> >>  tools/i386/rapl-msr-index.h              |  28 ++
> >> >>  19 files changed, 1831 insertions(+), 1 deletion(-)
> >> >>  create mode 100644 contrib/systemd/qemu-vmsr-helper.service
> >> >>  create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
> >> >>  create mode 100644 docs/specs/rapl-msr.rst
> >> >>  create mode 100644 docs/tools/qemu-vmsr-helper.rst
> >> >>  create mode 100644 target/i386/kvm/vmsr_energy.c
> >> >>  create mode 100644 target/i386/kvm/vmsr_energy.h
> >> >>  create mode 100644 tools/i386/qemu-vmsr-helper.c
> >> >>  create mode 100644 tools/i386/rapl-msr-index.h
> >> >>     
> >>   
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 0/3] Add support for the RAPL MSRs series
  2024-11-02  9:32                 ` Anthony Harivel
@ 2024-11-04  9:49                   ` Igor Mammedov
  2024-11-05  7:11                     ` Christian Horn
  0 siblings, 1 reply; 25+ messages in thread
From: Igor Mammedov @ 2024-11-04  9:49 UTC (permalink / raw)
  To: Anthony Harivel
  Cc: Daniel P. Berrangé, pbonzini, mtosatti, qemu-devel, vchundur,
	rjarry, nathans, kenj, chorn, sunyanan.choochotkaew1,
	vibhu.sharma2929

On Sat, 02 Nov 2024 10:32:17 +0100
"Anthony Harivel" <aharivel@redhat.com> wrote:

> Hi Igor,
> 
> Igor Mammedov, Nov 01, 2024 at 16:09:
> > On Tue, 22 Oct 2024 16:16:36 +0200
> > "Anthony Harivel" <aharivel@redhat.com> wrote:
> >  
> >> Daniel P. Berrangé, Oct 22, 2024 at 15:15:  
> >> > On Tue, Oct 22, 2024 at 02:46:15PM +0200, Igor Mammedov wrote:    
> >> >> On Fri, 18 Oct 2024 13:59:34 +0100
> >> >> Daniel P. Berrangé <berrange@redhat.com> wrote:
> >> >>     
> >> >> > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote:    
> >> >> > > On Wed, 16 Oct 2024 14:56:39 +0200
> >> >> > > "Anthony Harivel" <aharivel@redhat.com> wrote:    
> >> >> [...]
> >> >>     
> >> >> > > 
> >> >> > > This also leads to a question, if we should account for
> >> >> > > not VCPU threads at all. Looking at real hardware, those
> >> >> > > MSRs return power usage of CPUs only, and they do not
> >> >> > > return consumption from auxiliary system components
> >> >> > > (io/memory/...). One can consider non VCPU threads in QEMU
> >> >> > > as auxiliary components, so we probably should not to
> >> >> > > account for them at all when modeling the same hw feature.
> >> >> > > (aka be consistent with what real hw does).      
> >> >> > 
> >> >> > I understand your POV, but I think that would be a mistake,
> >> >> > and would undermine the usefulness of the feature.
> >> >> > 
> >> >> > The deployment model has a cluster of hosts and guests, all
> >> >> > belonging to the same user. The user goal is to measure host
> >> >> > power consumption imposed by the guest, and dynamically adjust
> >> >> > guest workloads in order to minimize power consumption of the
> >> >> > host.    
> >> >> 
> >> >> For cloud use-case, host side is likely in a better position
> >> >> to accomplish the task of saving power by migrating VM to
> >> >> another socket/host to compact idle load. (I've found at least 1
> >> >> kubernetis tool[1], which does energy monitoring). Perhaps there
> >> >> are schedulers out there that do that using its data.    
> >> 
> >> I also work for Kepler project. I use it to monitor my VM has a black 
> >> box and I used it inside my VM with this feature enable. Thanks to that 
> >> I can optimize the workloads (dpdk application,database,..) inside my VM. 
> >> 
> >> This is the use-case in NFV deployment and I'm pretty sure this could be 
> >> the use-case of many others.
> >>   
> >> >
> >> > The host admin can merely shuffle workloads around, hoping that
> >> > a different packing of workloads onto machines, will reduce power
> >> > in some aount. You might win a few %, or low 10s of % with this
> >> > if you're good at it.
> >> >
> >> > The guest admin can change the way their workload operates to
> >> > reduce its inherant power consumption baseline. You could easily
> >> > come across ways to win high 10s of % with this. That's why it
> >> > is interesting to expose power consumption info to the guest
> >> > admin.
> >> >
> >> > IOW, neither makes the other obsolete, both approaches are
> >> > desirable.
> >> >    
> >> >> > The guest workloads can impose non-negligble power consumption
> >> >> > loads on non-vCPU threads in QEMU. Without that accounted for,
> >> >> > any adjustments will be working from (sometimes very) inaccurate
> >> >> > data.    
> >> >> 
> >> >> Perhaps adding one or several energy sensors (ex: some i2c ones),
> >> >> would let us provide auxiliary threads consumption to guest, and
> >> >> even make it more granular if necessary (incl. vhost user/out of
> >> >> process device models or pass-through devices if they have PMU).
> >> >> It would be better than further muddling vCPUs consumption
> >> >> estimates with something that doesn't belong there.    
> >> 
> >> I'm confused about your statement. Like every software power metering 
> >> tools out is using RAPL (Kepler, Scaphandre, PowerMon, etc) and custom 
> >> sensors would be better than a what everyone is using ?  
> >
> > RAPL is used to measure CPU/DRAM/maybe GPU domains.
> > see my other reply to Daniel RAPL + aux
> >  (https://www.mail-archive.com/qemu-devel@nongnu.org/msg1072593.html)
> > My point wrt RAPL is: CPU domain on host and inside guest
> > should be doing the same thing, i.e. report only package/core
> > consumption of virtual CPU and nothing else (non vCPU induced load
> > should not be included in CPU domain).
> >
> > For non vCPU consumption, we should do the same as bare-metal,
> > i.e. add power sensors where necessary. As minimum we can add
> > a system power meter sensor, which could account for total
> > energy draw (and that can include not only QEMU aux threads,
> > but also for other related processes (aka process handling dpdk NIC,
> > or other vhost user backend)).
> > Individual per device sensors also a possibility in the future
> > (i.e per NIC) is we can find a suitable sensor on host to derive
> > guest value.
> >
> > [...]
> >  
> >> Adding RAPL inside VM makes total sens because you can use tools that 
> >> are already out in the market.  
> > no disagreement here.
> >
> > Given the topic is relatively new, the tooling mostly concentrates on
> > RAPL as most available sensor. But some tools can pull energy values
> > from other sources, we surely can teach them to pull values from
> > a sensor(s) we'd want to add to QEMU (i.e. for an easy start borrow
> > sensor handling from lm_sensors). I'd pick acpi power meter as
> > a possible candidate for it is being guest OS agnostic and
> > we can attach it to anything in machine tree.
> >  
> >> > There's a tradeoff here in that info directly associated with
> >> > backends threads, is effectively exposing private QEMU impl
> >> > details as public ABI. IOW, we don't want too fine granularity
> >> > here, we need it abstracted sufficiently, that different
> >> > backend choices for a given don't change what sensors are
> >> > exposed.
> >> >
> >> > I also wonder how existing power monitoring applications
> >> > would consume such custom sensors - is there sufficient
> >> > standardization in this are that we're not inventing
> >> > something totally QEMU specific ?
> >> >    
> >> >> > IOW, I think it is right to include non-vCPU threads usage in
> >> >> > the reported info, as it is still fundamentally part of the
> >> >> > load that the guest imposes on host pCPUs it is permitted to
> >> >> > run on.    
> >> >> 
> >> >> 
> >> >> From what I've read, process energy usage done via RAPL is not
> >> >> exactly accurate. But there are monitoring tools out there that
> >> >> use RAPL and other sources to make energy consumption monitoring
> >> >> more reliable.
> >> >> 
> >> >> Reinventing that wheel and pulling all of the nuances of process
> >> >> power monitoring inside of QEMU process, needlessly complicates it.
> >> >> Maybe we should reuse one of existing tools and channel its data
> >> >> through appropriate QEMU channels (RAPL/emulated PMU counters/...).    
> >> >
> >> > Note, this feature is already released in QEMU 9.1.0.
> >> >    
> >> >> Implementing RAPL in pure form though looks fine to me,
> >> >> so the same tools could use it the same way as on the host
> >> >> if needed without VM specific quirks.    
> >> >
> >> > IMHO the so called "pure" form is misleading to applications, unless
> >> > we first provided  some other pratical way to expose the data that
> >> > we would be throwing away from RAPL.
> >> >    
> >> 
> >> The other possibility that I've think of is using a 3rd party tool to 
> >> give maybe more "accurate value" to QEMU. 
> >> For example, Kepler could be used to give value for each thread 
> >> of QEMU and so instead of calculating and using the qemu-vmsr-helper, 
> >> each values is transfered on request by QEMU via the UNIX thread that is 
> >> used today between the daemon and QEMU. It's just an idea that I have 
> >> and I don't know if that is acceptable for each project (QEMU and 
> >> Kepler) that would really solve few issues.  
> >
> > From QEMU point of view, it would be fine to get values from external
> > process and just proxy them to guest (preferably without any massaging).
> >
> > Also on QEMU side, I'd suggest to split current monolith functionality
> > in 2 parts: frontend (KVM MSR interface for starters) and backend object
> > (created with -object CLI option) that will handle communication
> > with an external daemon. That way QEMU would be able easily change/add
> > different frontend and backend options (ex: add frontend for RAPL
> > with TCG accel, add backend for Kelper or other project(s)
> > down the road). (it would be good to make this split even for
> > qemu-vmsr-helper). (if you are interested, I can guide you wrt
> > QEMU side of the question).
> >
> > PS:
> > As for other projects we probably should ask if they are open to an idea.
> > They definitely would need some patches for per thread accounting,
> > and maybe for some API to talk with external users (but the later
> > might exist and it might be better for QEMU to adopt it (here QEMU
> > backend object might help as translator of existing protocol to
> > QEMU specific internals).
> > The point is QEMU won't have to reinvent wheel, and other projects
> > will get more exposure/user-base.
> >
> > On top of the projects, you've already pointed out for possible
> > integration with. I could add pmdadenki (CCed few authors) which
> > some distros are shipping/using.
> >  
> 
> I think you have a fair amount of ideas and opinions on how to handle the 
> RAPL in QEMU and that's really good for improving the features. 
> 
> What I would really like is to have Paolo's opinions on all of that. When 
> I started working on the subject I talked to him several time and we 
> agreed on the current implementation. 
> 
> Not that I disagree with all you said, to the contrary, but the amount 
> of change is quite significant and it would be very annoying if results 
> of this work doesn't make upstream because of Y & X.

split frontend/backend design is established pattern in QEMU, so I'm not
suggesting anything revolutionary (probability that anyone would object
to it is very low).

sending an RFC can serve as a starting point for discussion.  

> 
> Let's see if we have more opinions from the people in the loop as well.

yep, given that it would be better to reuse existing power monitoring
projects, it would be nice to hear some feedback from them. 

> 
> Thanks for feedback.
> 
> Anthony
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 0/3] Add support for the RAPL MSRs series
  2024-11-04  9:49                   ` Igor Mammedov
@ 2024-11-05  7:11                     ` Christian Horn
  2024-11-05 12:19                       ` Igor Mammedov
  0 siblings, 1 reply; 25+ messages in thread
From: Christian Horn @ 2024-11-05  7:11 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Anthony Harivel, Daniel P. Berrangé, pbonzini, mtosatti,
	qemu-devel, vchundur, rjarry, nathans, kenj,
	sunyanan.choochotkaew1, vibhu.sharma2929

Hi all,

some thoughts:

- I vote for making the metrics as much as possible in the guest available
  as on the host.  Allows cascading, and having in-guest-monitoring working
  like on bare metal.
- As result, really just plain vCPU consumption would be made available
  in the guest as rapl-core.  If the host can at some point understand
  guests GPU, or I/O consumption, better hand that in separately.
- Having in mind that we will also need this for other architectures, 
  at least aarch64.  RAPL comes from x86, rather than extending that
  to also do I/O or such, we might aim at an interface which will also
  work for aarch64.
- Bigger scope will be to look at the consumption of multiple systems, for
  that we will need to move the metrics to network eventually, changing
  from MSR or such mechanisms.
- For reading the metrics in the guest, I was tempted to suggest PCP with
  pmda-denki to cover RAPL, but it's right now just reading /sysfs, not
  MSR's.  pmda-lmsensors for further sensors offered on various systems,
  and pmda-openmetrics for covering anything appearing somewhere on
  /sysfs as a number.
 

> > Not that I disagree with all you said, to the contrary, but the amount 
> > of change is quite significant and it would be very annoying if results 
> > of this work doesn't make upstream because of Y & X.
> 
> split frontend/backend design is established pattern in QEMU, so I'm not
> suggesting anything revolutionary (probability that anyone would object
> to it is very low).
> 
> sending an RFC can serve as a starting point for discussion.  

+1,
Christian


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 0/3] Add support for the RAPL MSRs series
  2024-11-05  7:11                     ` Christian Horn
@ 2024-11-05 12:19                       ` Igor Mammedov
  2024-11-06  3:14                         ` Christian Horn
  0 siblings, 1 reply; 25+ messages in thread
From: Igor Mammedov @ 2024-11-05 12:19 UTC (permalink / raw)
  To: Christian Horn
  Cc: chorn, Anthony Harivel, Daniel P. Berrangé, pbonzini,
	mtosatti, qemu-devel, vchundur, rjarry, nathans, kenj,
	sunyanan.choochotkaew1, vibhu.sharma2929

On Tue, 5 Nov 2024 08:11:14 +0100
Christian Horn <chorn@fluxcoil.net> wrote:

> Hi all,
> 
> some thoughts:
> 
> - I vote for making the metrics as much as possible in the guest available
>   as on the host.  Allows cascading, and having in-guest-monitoring working
>   like on bare metal.
> - As result, really just plain vCPU consumption would be made available
>   in the guest as rapl-core.  If the host can at some point understand
>   guests GPU, or I/O consumption, better hand that in separately.
> - Having in mind that we will also need this for other architectures, 
>   at least aarch64.  RAPL comes from x86, rather than extending that
>   to also do I/O or such, we might aim at an interface which will also
>   work for aarch64.

+1 to both points

> - Bigger scope will be to look at the consumption of multiple systems, for
>   that we will need to move the metrics to network eventually, changing
>   from MSR or such mechanisms.

That's aren't VM scope though, which this topic is about.
But yes, the same tools as on baremetal can collect data and send/aggregate
them elsewhere. The main point from VM perspective is act just like baremetal
systems so the same monitoring tools could be reused. 

> - For reading the metrics in the guest, I was tempted to suggest PCP with
>   pmda-denki to cover RAPL, but it's right now just reading /sysfs, not
>   MSR's.  pmda-lmsensors for further sensors offered on various systems,
For NVF usecase, I also was eyeing pmda-denki.

How hard it would be to add MSR based sampling to denki?
Can we borrow Anthony's MSR sampling from
qemu-vmsr-helper, to reduce amount of work needed.

Also, for guest per vCPU accounting, we would need per thread
accounting (which I haven't noticed from a quick look at denki).
So some effort would be needed to add it there.  

I didn't know about pmda-lmsensors, I guess we should be able to use
it out of box with 'acpi power meter' sensor, if QEMU were to provide such.
I've also seen denki supporting battery power sensor, we can abuse that
and make QEMU provide that, but I'd rather add 'acpi power meter' sensor
to denki (which to some degree intersects with battery power sensor
functionality).

PS:
In this series Anthony uses custom protocol to get data from
privileged MSR helper to QEMU. Would it be acceptable?
Or is there a preferred way for PCP to do inter-process comms?

>   and pmda-openmetrics for covering anything appearing somewhere on
>   /sysfs as a number.

>  
> 
> > > Not that I disagree with all you said, to the contrary, but the amount 
> > > of change is quite significant and it would be very annoying if results 
> > > of this work doesn't make upstream because of Y & X.  
> > 
> > split frontend/backend design is established pattern in QEMU, so I'm not
> > suggesting anything revolutionary (probability that anyone would object
> > to it is very low).
> > 
> > sending an RFC can serve as a starting point for discussion.    
> 
> +1,
> Christian
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6 0/3] Add support for the RAPL MSRs series
  2024-11-05 12:19                       ` Igor Mammedov
@ 2024-11-06  3:14                         ` Christian Horn
  0 siblings, 0 replies; 25+ messages in thread
From: Christian Horn @ 2024-11-06  3:14 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Anthony Harivel, Daniel P. Berrangé, pbonzini, mtosatti,
	qemu-devel, vchundur, rjarry, nathans, kenj,
	sunyanan.choochotkaew1, vibhu.sharma2929

* Igor Mammedov さんが書きました:
> On Tue, 5 Nov 2024 08:11:14 +0100
> Christian Horn <chorn@fluxcoil.net> wrote:
> 
> > - For reading the metrics in the guest, I was tempted to suggest PCP with
> >   pmda-denki to cover RAPL, but it's right now just reading /sysfs, not
> >   MSR's.  pmda-lmsensors for further sensors offered on various systems,
> For NVF usecase, I also was eyeing pmda-denki.
> 
> How hard it would be to add MSR based sampling to denki?
> Can we borrow Anthony's MSR sampling from
> qemu-vmsr-helper, to reduce amount of work needed.

Should be possible.  Also for /sysfs we do a detection of domains, and
based on that register metrics and instances with pmcd.  For rapl-msr,
that could be done in a similiar way, i.e. as denki.rapl-msr,
or separating into denki.rapl.sysfs and denki.rapl.msr .

As for the actual doing, I'm not part of the engineering org but
support, so it's a spare time activity, when I get to it.  PCP
engineering has people on the project, a Jira would be a first step.
Direct pull requests to upstream are also a good start of course.  When
developing that, one would in cycles modify src/pmdas/denki/denki.c,
compile it, get pmcd to use the modified pmda-denki, look at debug
output and metrics.

> Also, for guest per vCPU accounting, we would need per thread
> accounting (which I haven't noticed from a quick look at denki).
> So some effort would be needed to add it there.  

I think we have these metrics in pmcd already from pmda-linux, i.e. we
can see them with this:
# pmrep -1gU -t 5 -J 3 proc.hog.cpu [..]
[ 1] - proc.hog.cpu["083377 /usr/lib64/firefox/firefox"]
[ 2] - proc.hog.cpu["084634 /usr/lib64/firefox/firefox"]
[ 3] - proc.hog.cpu["085225 md5sum"]
         1         2         3
     0.001     0.003    16.304
=> Top 3 consumers, process 3 is heaviest.
This uses derived metrics, computes from others, defined here:
$ cat /etc/pcp/derived/proc.conf
[..]
proc.hog.cpu = 100 * (rate(proc.psinfo.utime) + rate(proc.psinfo.stime)) / (kernel.all.uptime - proc.psinfo.start_time)
proc.hog.cpu(oneline) = average percentage CPU utilization of each process
[..]

I was brainstorming with Nathan about this in the past, but we did
not quickly get to something and lost track.  
Following the PCP approach, a client would query the required metrics
from pmcd (i.e. "process md5sum is right now using most cpu cycles"),
and together with "the overall VM or bare-metal-system consumes right
now 100W", one could attribute.  We might get away with derived 
metrics as per above.  If the computation is not doable with that, we
might also use own client code (i.e. C, or python) which gets the
metrics and computes the accounting per thread.
Last resort would be to collect the required process metrics in 
pmda-denki for computation there.

We might want to take this one out and discuss on PCP upstream,
i.e. pcp@groups.io .

> I didn't know about pmda-lmsensors, I guess we should be able to use
> it out of box with 'acpi power meter' sensor, if QEMU were to provide such.
> I've also seen denki supporting battery power sensor, we can abuse that
> and make QEMU provide that, but I'd rather add 'acpi power meter' sensor
> to denki (which to some degree intersects with battery power sensor
> functionality).

On this aarch64/Asahi macbook here, recent kernels made 
/sys/class/hwmon/hwmon1 available, and 'sensors' offers:
[chris@asahi sensors]$ sensors
[..]
Total System Power:       7.71 W
AC Input Power:           9.99 W
3.8 V Rail Power:         0.00 W
Heatpipe Power:           2.46 W
[..]

I'm still wondering how these fit into a picture like this one:
https://htmlpreview.github.io/?https://github.com/christianhorn/smallhelpers/blob/main/pmda-denki-handbook/denki.html#_hardware_requirements_new_version
So with these also overall system consumption is available while AC
powered - of course, just that hardware right now.

> PS:
> In this series Anthony uses custom protocol to get data from
> privileged MSR helper to QEMU. Would it be acceptable?

The only request would be that implementing that is "an optional ontop
source", so not preventing MSR access from bare metal hosts not having
it.  I guess that's given.  So then it's an abstracted channel we
provide into the guest.

> Or is there a preferred way for PCP to do inter-process comms?

Hm.. I thought this was here used to communicate between host and guest?

On the good side, if we get the per-thread-attribution done, we can
illustrate attribution up into guests with what mermaid calls sankey:
https://mermaid.js.org/syntax/sankey.html  :)

cheers,
-- 
Christian Horn
AMC Technical Account Manager, Red Hat K.K.
pgp fprint ADA6 C79C AF2E 973E 3F70  73C5 9373 49E7 347B 904F

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-11-06  7:41 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-22 15:34 [PATCH v6 0/3] Add support for the RAPL MSRs series Anthony Harivel
2024-05-22 15:34 ` [PATCH v6 1/3] qio: add support for SO_PEERCRED for socket channel Anthony Harivel
2024-05-22 15:34 ` [PATCH v6 2/3] tools: build qemu-vmsr-helper Anthony Harivel
2024-05-22 15:34 ` [PATCH v6 3/3] Add support for RAPL MSRs in KVM/Qemu Anthony Harivel
2024-10-16 12:17   ` Igor Mammedov
2024-10-16 13:04     ` Anthony Harivel
2024-06-26 14:34 ` [PATCH v6 0/3] Add support for the RAPL MSRs series Anthony Harivel
2024-10-16 11:52 ` Igor Mammedov
2024-10-16 12:56   ` Anthony Harivel
2024-10-18 12:25     ` Igor Mammedov
2024-10-18 12:59       ` Daniel P. Berrangé
2024-10-22 12:46         ` Igor Mammedov
2024-10-22 13:15           ` Daniel P. Berrangé
2024-10-22 14:16             ` Anthony Harivel
2024-10-22 14:29               ` Daniel P. Berrangé
2024-10-22 14:40                 ` Anthony Harivel
2024-11-01 15:09               ` Igor Mammedov
2024-11-02  9:32                 ` Anthony Harivel
2024-11-04  9:49                   ` Igor Mammedov
2024-11-05  7:11                     ` Christian Horn
2024-11-05 12:19                       ` Igor Mammedov
2024-11-06  3:14                         ` Christian Horn
2024-10-22 15:35             ` Igor Mammedov
2024-10-22 13:49       ` Anthony Harivel
2024-11-04  9:40         ` Igor Mammedov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).