[PATCH v2 0/3] Add support for RAPL MSRs series

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] Add support for RAPL MSRs series
@ 2023-10-31 14:46 Anthony Harivel
  2023-10-31 14:46 ` [PATCH v2 1/3] qio: add support for SO_PEERCRED for socket channel Anthony Harivel
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Anthony Harivel @ 2023-10-31 14:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, mtosatti, berrange, Anthony Harivel

Hello,

This v2 patch series tries to overcome the issue of the CVE 2020-8694
[1] while trying to read the RAPL MSR for populating the vitrual one on
KVM/QEMU virtual machine.

The solution proposed here is to create a helper daemon that would run
as a priviliged process and able to communicate via a socket to the QEMU
thread that deals with the ratio calculation of the energy counter.

So first it adds the SO_PEERCRED socket option in QIO CHANNEL so that
the helper daemon can check the PID of the peer (QEMU) to validate the
TID that is in the message. 

Then the daemon, called qemu-vmsr-helper, is added in the tools folder.
The daemon is very similar to the qemu-pr-helper in terms of operation.
However comminucation protocol is simplier and requires only one
coroutine to handle the peer request. Only the RAPL MSRs are allowed to
be read via the helper.

And to finish the last commit adds all the RAPL MSR in KVM/QEMU like the
v1 but, instead of reading directly the MSR via readmsr(), reads the
value through a socket comminucation. 

This is a follow-up of the V1 sent mid-june [2].

v1 -> v2
--------

- To overcome the CVE-2020-8694 a socket communication is created
  to a priviliged helper

- Add the priviliged helper (qemu-vmsr-helper)

- Add SO_PEERCRED in qio channel socket

RFC -> v1
---------

- Add vmsr_* in front of all vmsr specific function

- Change malloc()/calloc()... with all glib equivalent

- Pre-allocate all dynamic memories when possible

- Add a Documentation of implementation, limitation and usage

Regards,
Anthony

[1]: https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/running-average-power-limit-energy-reporting.html#cve-2020-8694
[2]: https://lore.kernel.org/qemu-devel/20230616140830.316655-1-aharivel@redhat.com/

Anthony Harivel (3):
  qio: add support for SO_PEERCRED for socket channel
  tools: build qemu-vmsr-helper
  Add support for RAPL MSRs in KVM/Qemu

 accel/kvm/kvm-all.c             |  26 ++
 docs/specs/index.rst            |   1 +
 docs/specs/rapl-msr.rst         | 131 +++++++++
 docs/tools/index.rst            |   1 +
 docs/tools/qemu-vmsr-helper.rst |  89 ++++++
 include/io/channel.h            |  20 ++
 include/sysemu/kvm_int.h        |  12 +
 io/channel-socket.c             |  17 ++
 io/channel.c                    |  12 +
 meson.build                     |   5 +
 target/i386/cpu.h               |   8 +
 target/i386/kvm/kvm.c           | 308 +++++++++++++++++++
 target/i386/kvm/meson.build     |   1 +
 target/i386/kvm/vmsr_energy.c   | 278 +++++++++++++++++
 target/i386/kvm/vmsr_energy.h   |  82 ++++++
 tools/i386/qemu-vmsr-helper.c   | 507 ++++++++++++++++++++++++++++++++
 tools/i386/rapl-msr-index.h     |  28 ++
 17 files changed, 1526 insertions(+)
 create mode 100644 docs/specs/rapl-msr.rst
 create mode 100644 docs/tools/qemu-vmsr-helper.rst
 create mode 100644 target/i386/kvm/vmsr_energy.c
 create mode 100644 target/i386/kvm/vmsr_energy.h
 create mode 100644 tools/i386/qemu-vmsr-helper.c
 create mode 100644 tools/i386/rapl-msr-index.h

-- 
2.41.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/3] qio: add support for SO_PEERCRED for socket channel
  2023-10-31 14:46 [PATCH v2 0/3] Add support for RAPL MSRs series Anthony Harivel
@ 2023-10-31 14:46 ` Anthony Harivel
  2023-11-01 10:20   ` Daniel P. Berrangé
  2023-10-31 14:46 ` [PATCH v2 2/3] tools: build qemu-vmsr-helper Anthony Harivel
  2023-10-31 14:46 ` [PATCH v2 3/3] Add support for RAPL MSRs in KVM/Qemu Anthony Harivel
  2 siblings, 1 reply; 11+ messages in thread
From: Anthony Harivel @ 2023-10-31 14:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, mtosatti, berrange, Anthony Harivel

The function qio_channel_get_peercred() returns a pointer to the
credentials of the peer process connected to this socket.

This credentials structure is defined in <sys/socket.h> as follows:

struct ucred {
	pid_t pid;    /* Process ID of the sending process */
	uid_t uid;    /* User ID of the sending process */
	gid_t gid;    /* Group ID of the sending process */
};

The use of this function is possible only for connected AF_UNIX stream
sockets and for AF_UNIX stream and datagram socket pairs.

Signed-off-by: Anthony Harivel <aharivel@redhat.com>
---
 include/io/channel.h | 20 ++++++++++++++++++++
 io/channel-socket.c  | 17 +++++++++++++++++
 io/channel.c         | 12 ++++++++++++
 3 files changed, 49 insertions(+)

diff --git a/include/io/channel.h b/include/io/channel.h
index 5f9dbaab65b0..99c02d61c3d9 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -149,6 +149,9 @@ struct QIOChannelClass {
                                   void *opaque);
     int (*io_flush)(QIOChannel *ioc,
                     Error **errp);
+    void (*io_peercred)(QIOChannel *ioc,
+                        struct ucred *cred,
+                        Error **errp);
 };
 
 /* General I/O handling functions */
@@ -898,4 +901,21 @@ int coroutine_mixed_fn qio_channel_writev_full_all(QIOChannel *ioc,
 int qio_channel_flush(QIOChannel *ioc,
                       Error **errp);
 
+/**
+ * qio_channel_get_peercred:
+ * @ioc: the channel object
+ * @cred: pointer to ucred struct
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Returns the credentials of the peer process connected to this socket.
+ *
+ * The use of this function is possible only for connected
+ * AF_UNIX stream sockets and for AF_UNIX stream and datagra
+ * socket pairs.
+ *
+ */
+void qio_channel_get_peercred(QIOChannel *ioc,
+                              struct ucred *cred,
+                              Error **errp);
+
 #endif /* QIO_CHANNEL_H */
diff --git a/io/channel-socket.c b/io/channel-socket.c
index 02ffb51e9957..b8285eb8ae49 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -836,6 +836,22 @@ qio_channel_socket_set_cork(QIOChannel *ioc,
     socket_set_cork(sioc->fd, v);
 }
 
+static void
+qio_channel_socket_get_peercred(QIOChannel *ioc,
+                                struct ucred *cred,
+                                Error **errp)
+{
+    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+    socklen_t len = sizeof(struct ucred);
+    Error *err = NULL;
+
+    if (getsockopt(sioc->fd,
+               SOL_SOCKET, SO_PEERCRED,
+               cred, &len) == -1) {
+        error_setg_errno(&err, errno, "Unable to get peer credentials");
+        error_propagate(errp, err);
+    }
+}
 
 static int
 qio_channel_socket_close(QIOChannel *ioc,
@@ -933,6 +949,7 @@ static void qio_channel_socket_class_init(ObjectClass *klass,
 #ifdef QEMU_MSG_ZEROCOPY
     ioc_klass->io_flush = qio_channel_socket_flush;
 #endif
+    ioc_klass->io_peercred = qio_channel_socket_get_peercred;
 }
 
 static const TypeInfo qio_channel_socket_info = {
diff --git a/io/channel.c b/io/channel.c
index 86c5834510ff..6dccccba5242 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -490,6 +490,18 @@ void qio_channel_set_cork(QIOChannel *ioc,
     }
 }
 
+void qio_channel_get_peercred(QIOChannel *ioc,
+                              struct ucred *cred,
+                              Error **errp)
+{
+    QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+    if (!klass->io_peercred) {
+        error_setg(errp, "Channel does not support random access");
+        return;
+    }
+    klass->io_peercred(ioc, cred, errp);
+}
 
 off_t qio_channel_io_seek(QIOChannel *ioc,
                           off_t offset,
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/3] tools: build qemu-vmsr-helper
  2023-10-31 14:46 [PATCH v2 0/3] Add support for RAPL MSRs series Anthony Harivel
  2023-10-31 14:46 ` [PATCH v2 1/3] qio: add support for SO_PEERCRED for socket channel Anthony Harivel
@ 2023-10-31 14:46 ` Anthony Harivel
  2023-11-01 10:36   ` Daniel P. Berrangé
                     ` (2 more replies)
  2023-10-31 14:46 ` [PATCH v2 3/3] Add support for RAPL MSRs in KVM/Qemu Anthony Harivel
  2 siblings, 3 replies; 11+ messages in thread
From: Anthony Harivel @ 2023-10-31 14:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, mtosatti, berrange, Anthony Harivel

Introduce a privileged helper to access RAPL MSR.

The privileged helper tool, qemu-vmsr-helper, is designed to provide
virtual machines with the ability to read specific RAPL (Running Average
Power Limit) MSRs without requiring CAP_SYS_RAWIO privileges or relying
on external, out-of-tree patches.

The helper tool leverages Unix permissions and SO_PEERCRED socket
options to enforce access control, ensuring that only processes
explicitly requesting read access via readmsr() from a valid Thread ID
can access these MSRs.

The list of RAPL MSRs that are allowed to be read by the helper tool is
defined in rapl-msr-index.h. This list corresponds to the RAPL MSRs that
will be supported in the next commit titled "Add support for RAPL MSRs
in KVM/QEMU."

Signed-off-by: Anthony Harivel <aharivel@redhat.com>
---
 docs/tools/index.rst            |   1 +
 docs/tools/qemu-vmsr-helper.rst |  89 ++++++
 meson.build                     |   5 +
 tools/i386/qemu-vmsr-helper.c   | 507 ++++++++++++++++++++++++++++++++
 tools/i386/rapl-msr-index.h     |  28 ++
 5 files changed, 630 insertions(+)
 create mode 100644 docs/tools/qemu-vmsr-helper.rst
 create mode 100644 tools/i386/qemu-vmsr-helper.c
 create mode 100644 tools/i386/rapl-msr-index.h

diff --git a/docs/tools/index.rst b/docs/tools/index.rst
index 8e65ce0dfc7b..33ad438e86f6 100644
--- a/docs/tools/index.rst
+++ b/docs/tools/index.rst
@@ -16,3 +16,4 @@ command line utilities and other standalone programs.
    qemu-pr-helper
    qemu-trace-stap
    virtfs-proxy-helper
+   qemu-vmsr-helper
diff --git a/docs/tools/qemu-vmsr-helper.rst b/docs/tools/qemu-vmsr-helper.rst
new file mode 100644
index 000000000000..6ec87b49d962
--- /dev/null
+++ b/docs/tools/qemu-vmsr-helper.rst
@@ -0,0 +1,89 @@
+==================================
+QEMU virtual RAPL MSR helper
+==================================
+
+Synopsis
+--------
+
+**qemu-vmsr-helper** [*OPTION*]
+
+Description
+-----------
+
+Implements the virtual RAPL MSR helper for QEMU.
+
+Accessing the RAPL (Running Average Power Limit) MSR enables the RAPL powercap
+driver to advertise and monitor the power consumption or accumulated energy
+consumption of different power domains, such as CPU packages, DRAM, and other
+components when available.
+
+However those register are accesible under priviliged access (CAP_SYS_RAWIO).
+QEMU can use an external helper to access those priviliged register.
+
+:program:`qemu-vmsr-helper` is that external helper; it creates a listener
+socket which will accept incoming connections for communication with QEMU.
+
+If you want to run VMs in a setup like this, this helper should be started as a
+system service, and you should read the QEMU manual section on "RAPL MSR
+support" to find out how to configure QEMU to connect to the socket created by
+:program:`qemu-vmsr-helper`.
+
+After connecting to the socket, :program:`qemu-vmsr-helper` can
+optionally drop root privileges, except for those capabilities that
+are needed for its operation.
+
+:program:`qemu-vmsr-helper` can also use the systemd socket activation
+protocol.  In this case, the systemd socket unit should specify a
+Unix stream socket, like this::
+
+    [Socket]
+    ListenStream=/var/run/qemu-vmsr-helper.sock
+
+Options
+-------
+
+.. program:: qemu-vmsr-helper
+
+.. option:: -d, --daemon
+
+  run in the background (and create a PID file)
+
+.. option:: -q, --quiet
+
+  decrease verbosity
+
+.. option:: -v, --verbose
+
+  increase verbosity
+
+.. option:: -f, --pidfile=PATH
+
+  PID file when running as a daemon. By default the PID file
+  is created in the system runtime state directory, for example
+  :file:`/var/run/qemu-vmsr-helper.pid`.
+
+.. option:: -k, --socket=PATH
+
+  path to the socket. By default the socket is created in
+  the system runtime state directory, for example
+  :file:`/var/run/qemu-vmsr-helper.sock`.
+
+.. option:: -T, --trace [[enable=]PATTERN][,events=FILE][,file=FILE]
+
+  .. include:: ../qemu-option-trace.rst.inc
+
+.. option:: -u, --user=USER
+
+  user to drop privileges to
+
+.. option:: -g, --group=GROUP
+
+  group to drop privileges to
+
+.. option:: -h, --help
+
+  Display a help message and exit.
+
+.. option:: -V, --version
+
+  Display version information and exit.
diff --git a/meson.build b/meson.build
index dcef8b1e7911..d30a7a09d46f 100644
--- a/meson.build
+++ b/meson.build
@@ -3950,6 +3950,11 @@ if have_tools
                dependencies: [authz, crypto, io, qom, qemuutil,
                               libcap_ng, mpathpersist],
                install: true)
+
+    executable('qemu-vmsr-helper', files('tools/i386/qemu-vmsr-helper.c'),
+               dependencies: [authz, crypto, io, qom, qemuutil,
+                              libcap_ng, mpathpersist],
+               install: true)
   endif
 
   if have_ivshmem
diff --git a/tools/i386/qemu-vmsr-helper.c b/tools/i386/qemu-vmsr-helper.c
new file mode 100644
index 000000000000..1d82a2753e44
--- /dev/null
+++ b/tools/i386/qemu-vmsr-helper.c
@@ -0,0 +1,507 @@
+/*
+ * Privileged RAPL MSR helper commands for QEMU
+ *
+ * Copyright (C) 2023 Red Hat, Inc. <aharivel@redhat.com>
+ *
+ * Author: Anthony Harivel <aharivel@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; under version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include <getopt.h>
+#include <stdbool.h>
+#include <sys/ioctl.h>
+#ifdef CONFIG_LIBCAP_NG
+#include <cap-ng.h>
+#endif
+#include <pwd.h>
+#include <grp.h>
+
+#include "qemu/help-texts.h"
+#include "qapi/error.h"
+#include "qemu/cutils.h"
+#include "qemu/main-loop.h"
+#include "qemu/module.h"
+#include "qemu/error-report.h"
+#include "qemu/config-file.h"
+#include "qemu-version.h"
+#include "qapi/error.h"
+#include "qemu/error-report.h"
+#include "qemu/log.h"
+#include "qemu/systemd.h"
+#include "qapi/util.h"
+#include "io/channel.h"
+#include "io/channel-socket.h"
+#include "trace/control.h"
+#include "qemu-version.h"
+#include "rapl-msr-index.h"
+
+#define MAX_PATH_LEN 50
+#define MAX_LINE_LEN 500
+#define PR_OUT_FIXED_PARAM_SIZE 24
+
+static char *socket_path;
+static char *pidfile;
+static enum { RUNNING, TERMINATE, TERMINATING } state;
+static QIOChannelSocket *server_ioc;
+static int server_watch;
+static int num_active_sockets = 1;
+static int noisy;
+static int verbose;
+
+#ifdef CONFIG_LIBCAP_NG
+static int uid = -1;
+static int gid = -1;
+#endif
+
+static void compute_default_paths(void)
+{
+    g_autofree char *state = qemu_get_local_state_dir();
+
+    socket_path = g_build_filename(state, "run", "qemu-vmsr-helper.sock", NULL);
+    pidfile = g_build_filename(state, "run", "qemu-vmsr-helper.pid", NULL);
+}
+
+/*
+ * Check if the TID that request the MSR read
+ * belongs to the peer. It should a TID of a vCPU.
+ */
+static bool is_tid_present(pid_t pid, pid_t tid)
+{
+    char pidStr[20];
+    char tidStr[20];
+
+    snprintf(pidStr, sizeof(pidStr), "%d", pid);
+    snprintf(tidStr, sizeof(tidStr), "%d", tid);
+
+    char pidPath[256];
+    char tidPath[256];
+
+    snprintf(pidPath, sizeof(pidPath), "/proc/%s", pidStr);
+    snprintf(tidPath, sizeof(tidPath), "/proc/%s/task/%s", pidStr, tidStr);
+
+    /* Check if the TID directory exists within the PID directory */
+    if (access(tidPath, F_OK) == 0) {
+        return true;
+    }
+
+    return false;
+}
+
+/*
+ * Only the RAPL MSR in target/i386/cpu.h are allowed
+ */
+static bool is_msr_allowed(uint32_t reg)
+{
+    switch (reg) {
+    case MSR_RAPL_POWER_UNIT:
+    case MSR_PKG_POWER_LIMIT:
+    case MSR_PKG_ENERGY_STATUS:
+    case MSR_PKG_POWER_INFO:
+        return true;
+    default:
+        return false;
+    }
+}
+
+static uint64_t vmsr_read_msr(uint32_t reg, unsigned int cpu_id)
+{
+    int fd;
+    uint64_t data;
+
+    char path[MAX_PATH_LEN];
+    snprintf(path, MAX_PATH_LEN, "/dev/cpu/%u/msr", cpu_id);
+
+    fd = open(path , O_RDONLY);
+    if (fd < 0) {
+        return 0;
+    }
+    if (pread(fd, &data, sizeof data, reg) != sizeof data) {
+        data = 0;
+    }
+
+    close(fd);
+    return data;
+}
+
+static void usage(const char *name)
+{
+    (printf) (
+"Usage: %s [OPTIONS] FILE\n"
+"Virtual RAPL MSR helper program for QEMU\n"
+"\n"
+"  -h, --help                display this help and exit\n"
+"  -V, --version             output version information and exit\n"
+"\n"
+"  -d, --daemon              run in the background\n"
+"  -f, --pidfile=PATH        PID file when running as a daemon\n"
+"                            (default '%s')\n"
+"  -k, --socket=PATH         path to the unix socket\n"
+"                            (default '%s')\n"
+"  -T, --trace [[enable=]<pattern>][,events=<file>][,file=<file>]\n"
+"                            specify tracing options\n"
+#ifdef CONFIG_LIBCAP_NG
+"  -u, --user=USER           user to drop privileges to\n"
+"  -g, --group=GROUP         group to drop privileges to\n"
+#endif
+"\n"
+QEMU_HELP_BOTTOM "\n"
+    , name, pidfile, socket_path);
+}
+
+static void version(const char *name)
+{
+    printf(
+"%s " QEMU_FULL_VERSION "\n"
+"Written by Anthony Harivel.\n"
+"\n"
+QEMU_COPYRIGHT "\n"
+"This is free software; see the source for copying conditions.  There is NO\n"
+"warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.\n"
+    , name);
+}
+
+typedef struct VMSRHelperClient {
+    QIOChannelSocket *ioc;
+    Coroutine *co;
+    int fd;
+} VMSRHelperClient;
+
+static void coroutine_fn vh_co_entry(void *opaque)
+{
+    VMSRHelperClient *client = opaque;
+    uint64_t vmsr;
+    uint32_t request[3];
+    struct ucred cred;
+    int r;
+    Error *local_err = NULL;
+
+    qio_channel_set_blocking(QIO_CHANNEL(client->ioc),
+                             false, NULL);
+
+    qio_channel_set_follow_coroutine_ctx(QIO_CHANNEL(client->ioc), true);
+
+    /*
+     * Check peer credentials
+     * Only QEMU PID/TID are allowed
+     */
+    qio_channel_get_peercred(QIO_CHANNEL(client->ioc), &cred, &local_err);
+
+    if (cred.pid == 0) {
+        if (local_err != NULL) {
+            error_report_err(local_err);
+        }
+        error_report("Failed to get peer credentials");
+        goto out;
+    }
+
+    /*
+     * Read the requested MSR
+     * Only RAPL MSR in rapl-msr-index.h is allowed
+     */
+    r = qio_channel_read_all(QIO_CHANNEL(client->ioc),
+                             (char *) &request, sizeof(request), NULL);
+    if (!is_msr_allowed(request[0]) || r < 0) {
+        error_report("Read request fail: %d, %d", request[0], request[1]);
+        goto out;
+    }
+
+    vmsr = vmsr_read_msr(request[0], request[1]);
+
+    if (!is_tid_present(cred.pid, request[2])) {
+        error_report("requested TID not in peer PID");
+        vmsr = 0;
+    }
+
+    r = qio_channel_write_all(QIO_CHANNEL(client->ioc),
+                         (char *) &vmsr, sizeof(vmsr), NULL);
+    if (r < 0) {
+        error_report("write vmsr failed");
+        goto out;
+    }
+
+out:
+    object_unref(OBJECT(client->ioc));
+    g_free(client);
+}
+
+static gboolean accept_client(QIOChannel *ioc,
+                              GIOCondition cond,
+                              gpointer opaque)
+{
+    QIOChannelSocket *cioc;
+    VMSRHelperClient *vmsrh;
+
+    cioc = qio_channel_socket_accept(QIO_CHANNEL_SOCKET(ioc),
+                                     NULL);
+    if (!cioc) {
+        return TRUE;
+    }
+
+    vmsrh = g_new(VMSRHelperClient, 1);
+    vmsrh->ioc = cioc;
+    vmsrh->fd = -1;
+    vmsrh->co = qemu_coroutine_create(vh_co_entry, vmsrh);
+    qemu_coroutine_enter(vmsrh->co);
+
+    return TRUE;
+}
+
+static void termsig_handler(int signum)
+{
+    qatomic_cmpxchg(&state, RUNNING, TERMINATE);
+    qemu_notify_event();
+}
+
+static void close_server_socket(void)
+{
+    assert(server_ioc);
+
+    g_source_remove(server_watch);
+    server_watch = -1;
+    object_unref(OBJECT(server_ioc));
+    num_active_sockets--;
+}
+
+#ifdef CONFIG_LIBCAP_NG
+static int drop_privileges(void)
+{
+    /* clear all capabilities */
+    capng_clear(CAPNG_SELECT_BOTH);
+
+    if (capng_update(CAPNG_ADD, CAPNG_EFFECTIVE | CAPNG_PERMITTED,
+                     CAP_SYS_RAWIO) < 0) {
+        return -1;
+    }
+
+    /*
+     * Change user/group id, retaining the capabilities.
+     * Because file descriptors are passed via SCM_RIGHTS,
+     * we don't need supplementary groups (and in fact the helper
+     * can run as "nobody").
+     */
+    if (capng_change_id(uid != -1 ? uid : getuid(),
+                        gid != -1 ? gid : getgid(),
+                        CAPNG_DROP_SUPP_GRP | CAPNG_CLEAR_BOUNDING)) {
+        return -1;
+    }
+
+    return 0;
+}
+#endif
+
+int main(int argc, char **argv)
+{
+    const char *sopt = "hVk:f:dT:u:g:vq";
+    struct option lopt[] = {
+        { "help", no_argument, NULL, 'h' },
+        { "version", no_argument, NULL, 'V' },
+        { "socket", required_argument, NULL, 'k' },
+        { "pidfile", required_argument, NULL, 'f' },
+        { "daemon", no_argument, NULL, 'd' },
+        { "trace", required_argument, NULL, 'T' },
+        { "user", required_argument, NULL, 'u' },
+        { "group", required_argument, NULL, 'g' },
+        { "verbose", no_argument, NULL, 'v' },
+        { "quiet", no_argument, NULL, 'q' },
+        { NULL, 0, NULL, 0 }
+    };
+    int opt_ind = 0;
+    int loglevel = 1;
+    int quiet = 0;
+    int ch;
+    Error *local_err = NULL;
+    bool daemonize = false;
+    bool pidfile_specified = false;
+    bool socket_path_specified = false;
+    unsigned socket_activation;
+
+    struct sigaction sa_sigterm;
+    memset(&sa_sigterm, 0, sizeof(sa_sigterm));
+    sa_sigterm.sa_handler = termsig_handler;
+    sigaction(SIGTERM, &sa_sigterm, NULL);
+    sigaction(SIGINT, &sa_sigterm, NULL);
+    sigaction(SIGHUP, &sa_sigterm, NULL);
+
+    signal(SIGPIPE, SIG_IGN);
+
+    error_init(argv[0]);
+    module_call_init(MODULE_INIT_TRACE);
+    module_call_init(MODULE_INIT_QOM);
+    qemu_add_opts(&qemu_trace_opts);
+    qemu_init_exec_dir(argv[0]);
+
+    compute_default_paths();
+
+    while ((ch = getopt_long(argc, argv, sopt, lopt, &opt_ind)) != -1) {
+        switch (ch) {
+        case 'k':
+            g_free(socket_path);
+            socket_path = g_strdup(optarg);
+            socket_path_specified = true;
+            if (socket_path[0] != '/') {
+                error_report("socket path must be absolute");
+                exit(EXIT_FAILURE);
+            }
+            break;
+        case 'f':
+            g_free(pidfile);
+            pidfile = g_strdup(optarg);
+            pidfile_specified = true;
+            break;
+#ifdef CONFIG_LIBCAP_NG
+        case 'u': {
+            unsigned long res;
+            struct passwd *userinfo = getpwnam(optarg);
+            if (userinfo) {
+                uid = userinfo->pw_uid;
+            } else if (qemu_strtoul(optarg, NULL, 10, &res) == 0 &&
+                       (uid_t)res == res) {
+                uid = res;
+            } else {
+                error_report("invalid user '%s'", optarg);
+                exit(EXIT_FAILURE);
+            }
+            break;
+        }
+        case 'g': {
+            unsigned long res;
+            struct group *groupinfo = getgrnam(optarg);
+            if (groupinfo) {
+                gid = groupinfo->gr_gid;
+            } else if (qemu_strtoul(optarg, NULL, 10, &res) == 0 &&
+                       (gid_t)res == res) {
+                gid = res;
+            } else {
+                error_report("invalid group '%s'", optarg);
+                exit(EXIT_FAILURE);
+            }
+            break;
+        }
+#else
+        case 'u':
+        case 'g':
+            error_report("-%c not supported by this %s", ch, argv[0]);
+            exit(1);
+#endif
+        case 'd':
+            daemonize = true;
+            break;
+        case 'q':
+            quiet = 1;
+            break;
+        case 'v':
+            ++loglevel;
+            break;
+        case 'T':
+            trace_opt_parse(optarg);
+            break;
+        case 'V':
+            version(argv[0]);
+            exit(EXIT_SUCCESS);
+            break;
+        case 'h':
+            usage(argv[0]);
+            exit(EXIT_SUCCESS);
+            break;
+        case '?':
+            error_report("Try `%s --help' for more information.", argv[0]);
+            exit(EXIT_FAILURE);
+        }
+    }
+
+    /* set verbosity */
+    noisy = !quiet && (loglevel >= 3);
+    verbose = quiet ? 0 : MIN(loglevel, 3);
+
+    if (!trace_init_backends()) {
+        exit(EXIT_FAILURE);
+    }
+    trace_init_file();
+    qemu_set_log(LOG_TRACE, &error_fatal);
+
+    socket_activation = check_socket_activation();
+    if (socket_activation == 0) {
+        SocketAddress saddr;
+        saddr = (SocketAddress){
+            .type = SOCKET_ADDRESS_TYPE_UNIX,
+            .u.q_unix.path = socket_path,
+        };
+        server_ioc = qio_channel_socket_new();
+        if (qio_channel_socket_listen_sync(server_ioc, &saddr,
+                                           1, &local_err) < 0) {
+            object_unref(OBJECT(server_ioc));
+            error_report_err(local_err);
+            return 1;
+        }
+    } else {
+        /* Using socket activation - check user didn't use -p etc. */
+        if (socket_path_specified) {
+            error_report("Unix socket can't be set when \
+                        using socket activation");
+            exit(EXIT_FAILURE);
+        }
+
+        /* Can only listen on a single socket.  */
+        if (socket_activation > 1) {
+            error_report("%s does not support socket activation \
+                        with LISTEN_FDS > 1",
+                        argv[0]);
+            exit(EXIT_FAILURE);
+        }
+        server_ioc = qio_channel_socket_new_fd(FIRST_SOCKET_ACTIVATION_FD,
+                                               &local_err);
+        if (server_ioc == NULL) {
+            error_reportf_err(local_err,
+                              "Failed to use socket activation: ");
+            exit(EXIT_FAILURE);
+        }
+    }
+
+    qemu_init_main_loop(&error_fatal);
+
+    server_watch = qio_channel_add_watch(QIO_CHANNEL(server_ioc),
+                                         G_IO_IN,
+                                         accept_client,
+                                         NULL, NULL);
+
+    if (daemonize) {
+        if (daemon(0, 0) < 0) {
+            error_report("Failed to daemonize: %s", strerror(errno));
+            exit(EXIT_FAILURE);
+        }
+    }
+
+    if (daemonize || pidfile_specified) {
+        qemu_write_pidfile(pidfile, &error_fatal);
+    }
+
+#ifdef CONFIG_LIBCAP_NG
+    if (drop_privileges() < 0) {
+        error_report("Failed to drop privileges: %s", strerror(errno));
+        exit(EXIT_FAILURE);
+    }
+#endif
+
+    state = RUNNING;
+    do {
+        main_loop_wait(false);
+        if (state == TERMINATE) {
+            state = TERMINATING;
+            close_server_socket();
+        }
+    } while (num_active_sockets > 0);
+
+    exit(EXIT_SUCCESS);
+}
diff --git a/tools/i386/rapl-msr-index.h b/tools/i386/rapl-msr-index.h
new file mode 100644
index 000000000000..9a7118639ae3
--- /dev/null
+++ b/tools/i386/rapl-msr-index.h
@@ -0,0 +1,28 @@
+/*
+ * Allowed list of MSR for Privileged RAPL MSR helper commands for QEMU
+ *
+ * Copyright (C) 2023 Red Hat, Inc. <aharivel@redhat.com>
+ *
+ * Author: Anthony Harivel <aharivel@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; under version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * Should stay in sync with the RAPL MSR
+ * in target/i386/cpu.h
+ */
+#define MSR_RAPL_POWER_UNIT             0x00000606
+#define MSR_PKG_POWER_LIMIT             0x00000610
+#define MSR_PKG_ENERGY_STATUS           0x00000611
+#define MSR_PKG_POWER_INFO              0x00000614
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 3/3] Add support for RAPL MSRs in KVM/Qemu
  2023-10-31 14:46 [PATCH v2 0/3] Add support for RAPL MSRs series Anthony Harivel
  2023-10-31 14:46 ` [PATCH v2 1/3] qio: add support for SO_PEERCRED for socket channel Anthony Harivel
  2023-10-31 14:46 ` [PATCH v2 2/3] tools: build qemu-vmsr-helper Anthony Harivel
@ 2023-10-31 14:46 ` Anthony Harivel
  2023-11-01 11:24   ` Daniel P. Berrangé
  2023-11-01 14:33   ` Paolo Bonzini
  2 siblings, 2 replies; 11+ messages in thread
From: Anthony Harivel @ 2023-10-31 14:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, mtosatti, berrange, Anthony Harivel

Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
interface (Running Average Power Limit) for advertising the accumulated
energy consumption of various power domains (e.g. CPU packages, DRAM,
etc.).

The consumption is reported via MSRs (model specific registers) like
MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
64 bits registers that represent the accumulated energy consumption in
micro Joules. They are updated by microcode every ~1ms.

For now, KVM always returns 0 when the guest requests the value of
these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
these MSRs dynamically in userspace.

To limit the amount of system calls for every MSR call, create a new
thread in QEMU that updates the "virtual" MSR values asynchronously.

Each vCPU has its own vMSR to reflect the independence of vCPUs. The
thread updates the vMSR values with the ratio of energy consumed of
the whole physical CPU package the vCPU thread runs on and the
thread's utime and stime values.

All other non-vCPU threads are also taken into account. Their energy
consumption is evenly distributed among all vCPUs threads running on
the same physical CPU package.

To overcome the problem that reading the RAPL MSR requires priviliged
access, a socket communication between QEMU and the qemu-vmsr-helper is
mandatory. You can specified the socket path in the parameter.

This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock

Actual limitation:
- Works only on Intel host CPU because AMD CPUs are using different MSR
  adresses.

- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at
  the moment.

- Since each vCPU has an independent vMSR value, the vCPU topology must
  be changed to match that reality. There must be a single vCPU per
  virtual socket (e.g.: -smp 4,sockets=4). Accessing pkg-0 energy will
  give vCPU 0 energy, pkg-1 will give vCPU 1 energy, etc.

Signed-off-by: Anthony Harivel <aharivel@redhat.com>

 accel/kvm/kvm-all.c           |  26 +++
 docs/specs/index.rst          |   1 +
 docs/specs/rapl-msr.rst       | 131 +++++++++++++++
 include/sysemu/kvm_int.h      |  12 ++
 target/i386/cpu.h             |   8 +
 target/i386/kvm/kvm.c         | 308 ++++++++++++++++++++++++++++++++++
 target/i386/kvm/meson.build   |   1 +
 target/i386/kvm/vmsr_energy.c | 278 ++++++++++++++++++++++++++++++
 target/i386/kvm/vmsr_energy.h |  82 +++++++++
 9 files changed, 847 insertions(+)
 create mode 100644 docs/specs/rapl-msr.rst
 create mode 100644 target/i386/kvm/vmsr_energy.c
 create mode 100644 target/i386/kvm/vmsr_energy.h

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 3f7eafe08cbe..e0df75932e8e 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3782,6 +3782,21 @@ static void kvm_set_dirty_ring_size(Object *obj, Visitor *v,
     s->kvm_dirty_ring_size = value;
 }
 
+static void kvm_set_kvm_rapl(Object *obj, bool value, Error **errp)
+{
+    KVMState *s = KVM_STATE(obj);
+    s->msr_energy.enable = value;
+}
+
+static void kvm_set_kvm_rapl_socket_path(Object *obj,
+                                         const char *str,
+                                         Error **errp)
+{
+    KVMState *s = KVM_STATE(obj);
+    g_free(s->msr_energy.socket_path);
+    s->msr_energy.socket_path = g_strdup(str);
+}
+
 static void kvm_accel_instance_init(Object *obj)
 {
     KVMState *s = KVM_STATE(obj);
@@ -3800,6 +3815,7 @@ static void kvm_accel_instance_init(Object *obj)
     s->xen_version = 0;
     s->xen_gnttab_max_frames = 64;
     s->xen_evtchn_max_pirq = 256;
+    s->msr_energy.enable = false;
 }
 
 /**
@@ -3840,6 +3856,16 @@ static void kvm_accel_class_init(ObjectClass *oc, void *data)
     object_class_property_set_description(oc, "dirty-ring-size",
         "Size of KVM dirty page ring buffer (default: 0, i.e. use bitmap)");
 
+    object_class_property_add_bool(oc, "rapl",
+                                   NULL,
+                                   kvm_set_kvm_rapl);
+    object_class_property_set_description(oc, "rapl",
+        "Allow energy related MSRs for RAPL interface in Guest");
+
+    object_class_property_add_str(oc, "path", NULL,
+                                  kvm_set_kvm_rapl_socket_path);
+    object_class_property_set_description(oc, "path",
+        "Socket Path for comminucating with the Virtual MSR helper daemon");
     kvm_arch_accel_class_init(oc);
 }
 
diff --git a/docs/specs/index.rst b/docs/specs/index.rst
index e58be38c41c7..5c2fa3d65877 100644
--- a/docs/specs/index.rst
+++ b/docs/specs/index.rst
@@ -24,3 +24,4 @@ guest hardware that is specific to QEMU.
    acpi_erst
    sev-guest-firmware
    fw_cfg
+   rapl-msr
diff --git a/docs/specs/rapl-msr.rst b/docs/specs/rapl-msr.rst
new file mode 100644
index 000000000000..ec62a8206337
--- /dev/null
+++ b/docs/specs/rapl-msr.rst
@@ -0,0 +1,131 @@
+================
+RAPL MSR support
+================
+
+The RAPL interface (Running Average Power Limit) is advertising the accumulated
+energy consumption of various power domains (e.g. CPU packages, DRAM, etc.).
+
+The consumption is reported via MSRs (model specific registers) like
+MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are 64 bits
+registers that represent the accumulated energy consumption in micro Joules.
+
+Thanks to the MSR Filtering patch [#a]_ not all MSRs are handled by KVM. Some
+of them can now be handled by the userspace (QEMU). It uses a mechanism called
+"MSR filtering" where a list of MSRs is given at init time of a VM to KVM so
+that a callback is put in place. The design of this patch uses only this
+mechanism for handling the MSRs between guest/host.
+
+At the moment the following MSRs are involved:
+
+.. code:: C
+
+    #define MSR_RAPL_POWER_UNIT             0x00000606
+    #define MSR_PKG_POWER_LIMIT             0x00000610
+    #define MSR_PKG_ENERGY_STATUS           0x00000611
+    #define MSR_PKG_POWER_INFO              0x00000614
+
+The ``*_POWER_UNIT``, ``*_POWER_LIMIT``, ``*_POWER INFO`` are part of the RAPL
+spec and specify the power limit of the package, provide range of parameter(min
+power, max power,..) and also the information of the multiplier for the energy
+counter to calculate the power. Those MSRs are populated once at the beginning
+by reading the host CPU MSRs and are given back to the guest 1:1 when
+requested.
+
+The MSR_PKG_ENERGY_STATUS is a counter; it represents the total amount of
+energy consumed since the last time the register was cleared. If you multiply
+it with the UNIT provided above you'll get the power in micro-joules. This
+counter is always increasing and it increases more or less faster depending on
+the consumption of the package. This counter is supposed to overflow at some
+point.
+
+Each core belonging to the same Package reading the MSR_PKG_ENERGY_STATUS (i.e
+"rdmsr 0x611") will retrieve the same value. The value represents the energy
+for the whole package. Whatever Core reading it will get the same value and a
+core that belongs to PKG-0 will not be able to get the value of PKG-1 and
+vice-versa.
+
+High level implementation
+-------------------------
+
+In order to update the value of the virtual MSR, a QEMU thread is created.
+The thread is basically just an infinity loop that does:
+
+1. Snapshot of the time metrics of all QEMU threads (Time spent scheduled in
+   Userspace and System)
+
+2. Snapshot of the actual MSR_PKG_ENERGY_STATUS counter of all packages where
+   the QEMU threads are running on.
+
+3. Sleep for 1 second - During this pause the vcpu and other non-vcpu threads
+   will do what they have to do and so the energy counter will increase.
+
+4. Repeat 2. and 3. and calculate the delta of every metrics representing the
+   time spent scheduled for each QEMU thread *and* the energy spent by the
+   packages during the pause.
+
+5. Filter the vcpu threads and the non-vcpu threads.
+
+6. The total energy spent by the non-vcpu threads is divided by the number
+   of vcpu threads so that each vcpu thread will get an equal part of the
+   energy spent by the QEMU workers.
+
+7. Calculate the ratio of energy spent per vcpu threads.
+
+8. The virtual MSRs are updated for each vcpu by adding for each vcpu the
+   energy spent + the quote part of the non-vcpu.
+
+9. loop back to 1.
+
+Ratio calculation
+-----------------
+
+In Linux, a process has an execution time associated with it. The scheduler is
+dividing the time in clock ticks. The number of clock ticks per second can be
+found by the sysconf system call. A typical value of clock ticks per second is
+100. So a core can run a process at the maximum of 100 ticks per second. If a
+package has 4 cores, 400 ticks maximum can be scheduled on all the cores
+of the package for a period of 1 second.
+
+The /proc/[pid]/stat [#b]_ is a sysfs file that can give the executed time of a
+process with the [pid] as the process ID. It gives the amount of ticks the
+process has been scheduled in userspace (utime) and kernel space (stime).
+
+By reading those metrics for a thread, one can calculate the ratio of time the
+package has spent executing the thread.
+
+Example:
+
+A 4 cores package can schedule a maximum of 400 ticks per second with 100 ticks
+per second per core. If a thread was scheduled for 100 ticks between a second
+on this package, that means my thread has been scheduled for 1/4 of the whole
+package. With that, the calculation of the energy spent by the thread on this
+package during this whole second is 1/4 of the total energy spent by the
+package.
+
+Usage
+-----
+
+This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock
+
+It is important that the socket path is the same as the one
+:program:`qemu-vmsr-helper` is listneing to.
+
+Current Limitations
+-------------------
+
+- Works only on Intel host CPUs because AMD CPUs are using different MSR
+  addresses.
+
+- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at the
+  moment.
+
+- Since each vCPU has an independent vMSR value, the vCPU topology must be
+  changed to match that reality. There must be a single vCPU per virtual socket
+  (e.g.: -smp 4,sockets=4). Accessing pkg-0 energy will give vCPU 0 energy,
+  pkg-1 will give vCPU 1 energy, etc.
+
+References
+----------
+
+.. [#a] https://patchwork.kernel.org/project/kvm/patch/20200916202951.23760-7-graf@amazon.com/
+.. [#b] https://man7.org/linux/man-pages/man5/proc.5.html
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 075939a3c4f5..2913510d1215 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -48,6 +48,17 @@ typedef struct KVMMemoryListener {
 
 #define KVM_MSI_HASHTAB_SIZE    256
 
+struct KVMMsrEnergy {
+    bool enable;
+    char *socket_path;
+    QemuThread msr_thr;
+    int cpus;
+    uint64_t *msr_value;
+    uint64_t msr_unit;
+    uint64_t msr_limit;
+    uint64_t msr_info;
+};
+
 enum KVMDirtyRingReaperState {
     KVM_DIRTY_RING_REAPER_NONE = 0,
     /* The reaper is sleeping */
@@ -119,6 +130,7 @@ struct KVMState
     bool kvm_dirty_ring_with_bitmap;
     uint64_t kvm_eager_split_size;  /* Eager Page Splitting chunk size */
     struct KVMDirtyRingReaper reaper;
+    struct KVMMsrEnergy msr_energy;
     NotifyVmexitOption notify_vmexit;
     uint32_t notify_window;
     uint32_t xen_version;
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 471e71dbc5eb..2662a55c8255 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -396,6 +396,10 @@ typedef enum X86Seg {
 #define MSR_IA32_TSX_CTRL		0x122
 #define MSR_IA32_TSCDEADLINE            0x6e0
 #define MSR_IA32_PKRS                   0x6e1
+#define MSR_RAPL_POWER_UNIT             0x00000606
+#define MSR_PKG_POWER_LIMIT             0x00000610
+#define MSR_PKG_ENERGY_STATUS           0x00000611
+#define MSR_PKG_POWER_INFO              0x00000614
 #define MSR_ARCH_LBR_CTL                0x000014ce
 #define MSR_ARCH_LBR_DEPTH              0x000014cf
 #define MSR_ARCH_LBR_FROM_0             0x00001500
@@ -1787,6 +1791,10 @@ typedef struct CPUArchState {
 
     uintptr_t retaddr;
 
+    /* RAPL MSR */
+    uint64_t msr_rapl_power_unit;
+    uint64_t msr_pkg_energy_status;
+
     /* Fields up to this point are cleared by a CPU reset */
     struct {} end_reset_fields;
 
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index e7c054cc160b..6dd8576b6ef6 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -16,11 +16,16 @@
 #include "qapi/qapi-events-run-state.h"
 #include "qapi/error.h"
 #include "qapi/visitor.h"
+#include <math.h>
+#include <stdint.h>
 #include <sys/ioctl.h>
 #include <sys/utsname.h>
 #include <sys/syscall.h>
+#include <sys/resource.h>
+#include <sys/time.h>
 
 #include <linux/kvm.h>
+#include <unistd.h>
 #include "standard-headers/asm-x86/kvm_para.h"
 #include "hw/xen/interface/arch-x86/cpuid.h"
 
@@ -35,6 +40,7 @@
 #include "xen-emu.h"
 #include "hyperv.h"
 #include "hyperv-proto.h"
+#include "vmsr_energy.h"
 
 #include "exec/gdbstub.h"
 #include "qemu/host-utils.h"
@@ -2524,6 +2530,49 @@ static bool kvm_rdmsr_core_thread_count(X86CPU *cpu, uint32_t msr,
     return true;
 }
 
+static bool kvm_rdmsr_rapl_power_unit(X86CPU *cpu, uint32_t msr,
+                                        uint64_t *val)
+{
+
+    CPUState *cs = CPU(cpu);
+
+    *val = cs->kvm_state->msr_energy.msr_unit;
+
+    return true;
+}
+
+static bool kvm_rdmsr_pkg_power_limit(X86CPU *cpu, uint32_t msr,
+                                        uint64_t *val)
+{
+
+    CPUState *cs = CPU(cpu);
+
+    *val = cs->kvm_state->msr_energy.msr_limit;
+
+    return true;
+}
+
+static bool kvm_rdmsr_pkg_power_info(X86CPU *cpu, uint32_t msr,
+                                        uint64_t *val)
+{
+
+    CPUState *cs = CPU(cpu);
+
+    *val = cs->kvm_state->msr_energy.msr_info;
+
+    return true;
+}
+
+static bool kvm_rdmsr_pkg_energy_status(X86CPU *cpu, uint32_t msr,
+    uint64_t *val)
+{
+
+    CPUState *cs = CPU(cpu);
+    *val = cs->kvm_state->msr_energy.msr_value[cs->cpu_index];
+
+    return true;
+}
+
 static Notifier smram_machine_done;
 static KVMMemoryListener smram_listener;
 static AddressSpace smram_address_space;
@@ -2558,6 +2607,225 @@ static void register_smram_listener(Notifier *n, void *unused)
                                  &smram_address_space, 1, "kvm-smram");
 }
 
+static void *kvm_msr_energy_thread(void *data)
+{
+    KVMState *s = data;
+    struct KVMMsrEnergy *vmsr = &s->msr_energy;
+    unsigned int maxpkgs, maxcpus, maxticks;
+    package_energy_stat *pkg_stat;
+    int num_threads, tmp_num_threads = 0;
+    thread_stat *thd_stat;
+    CPUState *cpu;
+    pid_t pid, *thread_ids;
+
+    rcu_register_thread();
+
+    /* Get QEMU PID*/
+    pid = getpid();
+
+    /* Assuming those values are the same accross physical system/packages */
+    /* Nb of CPUS per packages */
+    maxcpus = vmsr_get_maxcpus(0);
+    /* Nb of Physical Packages on the system */
+    maxpkgs = vmsr_get_max_physical_package(maxcpus);
+
+    if (maxpkgs == 0) {
+        return NULL;
+    }
+
+    /* Those MSR values should not change as well */
+    vmsr->msr_unit  = vmsr_read_msr(MSR_RAPL_POWER_UNIT, 0, pid,
+                                    s->msr_energy.socket_path);
+    vmsr->msr_limit = vmsr_read_msr(MSR_PKG_POWER_LIMIT, 0, pid,
+                                    s->msr_energy.socket_path);
+    vmsr->msr_info  = vmsr_read_msr(MSR_PKG_POWER_INFO, 0, pid,
+                                    s->msr_energy.socket_path);
+
+    /* Allocate memory for each package energy status */
+    pkg_stat = (package_energy_stat *)
+        g_malloc0(maxpkgs * sizeof(package_energy_stat));
+
+    /* Pre-allocate memory for thread stats */
+    thd_stat = g_new0(thread_stat, 1);
+
+    /*
+     * Max numbers of ticks per package
+     * time in second * number of ticks/second * Number of cores / package
+     * ex: for 100 ticks/second/CPU, 12 CPUs per Package gives 1200 ticks max
+     */
+    maxticks = (MSR_ENERGY_THREAD_SLEEP_US / 1000000)
+                    * sysconf(_SC_CLK_TCK) * maxcpus;
+
+    while (true) {
+
+        /* Get all qemu threads id */
+        thread_ids = vmsr_get_thread_ids(pid, &num_threads);
+
+        if (thread_ids == NULL) {
+            goto clean;
+        }
+
+        if (tmp_num_threads < num_threads) {
+
+            void *tmp_ptr;
+
+            tmp_ptr = g_realloc(thd_stat, num_threads * sizeof(thread_stat));
+            thd_stat = (thread_stat *) tmp_ptr;
+        }
+
+        tmp_num_threads = num_threads;
+
+        /* Populate all the thread stats */
+        for (int i = 0; i < num_threads; i++) {
+            thd_stat[i].utime = calloc(2, sizeof(unsigned long long));
+            thd_stat[i].stime = calloc(2, sizeof(unsigned long long));
+            thd_stat[i].thread_id = thread_ids[i];
+            vmsr_read_thread_stat(&thd_stat[i], pid, 0);
+            thd_stat[i].numa_node_id = numa_node_of_cpu(thd_stat[i].cpu_id);
+        }
+
+        /* Retrieve all packages power plane energy counter */
+        for (int i = 0; i <= maxpkgs; i++) {
+            for (int j = 0; j < num_threads; j++) {
+                /*
+                 * Use the first thread we found that ran on the CPU
+                 * of the package to read the packages energy counter
+                 */
+                if (thd_stat[j].numa_node_id == i) {
+                    pkg_stat[i].e_start =
+                    vmsr_read_msr(MSR_PKG_ENERGY_STATUS, i, pid,
+                                  s->msr_energy.socket_path);
+                    break;
+                }
+            }
+        }
+
+        /* Sleep a short period while the other threads are working */
+        usleep(MSR_ENERGY_THREAD_SLEEP_US);
+
+        /*
+         * Retrieve all packages power plane energy counter
+         * Calculate the delta of all packages
+         */
+        for (int i = 0; i <= maxpkgs; i++) {
+            for (int j = 0; j < num_threads; j++) {
+                /*
+                 * Use the first thread we found that ran on the CPU
+                 * of the package to read the packages energy counter
+                 */
+                if (thd_stat[j].numa_node_id == i) {
+                    pkg_stat[i].e_end =
+                       vmsr_read_msr(MSR_PKG_ENERGY_STATUS,
+                                     thd_stat[j].cpu_id,
+                                     thd_stat[j].thread_id,
+                                     s->msr_energy.socket_path);
+                    pkg_stat[i].e_delta =
+                        pkg_stat[i].e_end - pkg_stat[i].e_start;
+                    break;
+                }
+            }
+        }
+
+        /* Delta of ticks spend by each thread between the sample */
+        for (int i = 0; i < num_threads; i++) {
+            if (vmsr_read_thread_stat(&thd_stat[i], pid, 1) != 0) {
+                /*
+                 * We don't count the dead thread
+                 * i.e threads that existed before the sleep
+                 * and not anymore
+                 */
+                thd_stat[i].delta_ticks = 0;
+            } else {
+                vmsr_delta_ticks(thd_stat, i);
+            }
+        }
+
+        /*
+         * Identify the vCPU threads
+         * Calculate the Number of vCPU per package
+         */
+        CPU_FOREACH(cpu) {
+            for (int i = 0; i < num_threads; i++) {
+                if (cpu->thread_id == thd_stat[i].thread_id) {
+                    thd_stat[i].is_vcpu = true;
+                    thd_stat[i].vcpu_id = cpu->cpu_index;
+                    pkg_stat[thd_stat[i].numa_node_id].nb_vcpu++;
+                    break;
+                }
+            }
+        }
+
+        /* Calculate the total energy of all non-vCPU thread */
+        for (int i = 0; i < num_threads; i++) {
+            double temp;
+            if ((thd_stat[i].is_vcpu != true) &&
+                (thd_stat[i].delta_ticks > 0)) {
+                temp = vmsr_get_ratio(pkg_stat, thd_stat, maxticks, i);
+                pkg_stat[thd_stat[i].numa_node_id].e_ratio
+                    += (uint64_t)lround(temp);
+            }
+        }
+
+        /* Calculate the ratio per non-vCPU thread of each package */
+        for (int i = 0; i <= maxpkgs; i++) {
+            if (pkg_stat[i].nb_vcpu > 0) {
+                pkg_stat[i].e_ratio = pkg_stat[i].e_ratio / pkg_stat[i].nb_vcpu;
+            }
+        }
+
+        /* Calculate the energy for each vCPU thread */
+        for (int i = 0; i < num_threads; i++) {
+            double temp;
+
+            if ((thd_stat[i].is_vcpu == true) &&
+                (thd_stat[i].delta_ticks > 0)) {
+                temp = vmsr_get_ratio(pkg_stat, thd_stat, maxticks, i);
+                vmsr->msr_value[thd_stat[i].vcpu_id] += (uint64_t)lround(temp);
+                vmsr->msr_value[thd_stat[i].vcpu_id] \
+                    += pkg_stat[thd_stat[i].numa_node_id].e_ratio;
+            }
+        }
+
+        /* free all memory */
+        for (int i = 0; i < num_threads; i++) {
+            memset(thd_stat[i].utime, 0, 2 * sizeof(unsigned long long));
+            memset(thd_stat[i].stime, 0, 2 * sizeof(unsigned long long));
+        }
+        /* Zero out the memory */
+        memset(thd_stat, 0, num_threads * sizeof(thread_stat));
+        memset(thread_ids, 0, sizeof(pid_t));
+    }
+
+clean:
+    /* free all memory */
+    for (int i = 0; i < num_threads; i++) {
+        g_free(thd_stat[i].utime);
+        g_free(thd_stat[i].stime);
+    }
+    g_free(thd_stat);
+    g_free(thread_ids);
+
+    rcu_unregister_thread();
+    return NULL;
+}
+
+static int kvm_msr_energy_thread_init(KVMState *s, MachineState *ms)
+{
+    struct KVMMsrEnergy *r = &s->msr_energy;
+
+    /* Retrieve the number of vCPU */
+    r->cpus = ms->smp.cpus;
+
+    /* Allocate register memory (MSR_PKG_STATUS) for each vCPU */
+    r->msr_value = calloc(r->cpus, sizeof(r->msr_value));
+
+    qemu_thread_create(&r->msr_thr, "kvm-msr",
+                       kvm_msr_energy_thread,
+                       s, QEMU_THREAD_JOINABLE);
+
+    return 0;
+}
+
 int kvm_arch_get_default_type(MachineState *ms)
 {
     return 0;
@@ -2774,6 +3042,46 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
                          strerror(-ret));
             exit(1);
         }
+
+        if (s->msr_energy.enable == true) {
+
+            r = kvm_filter_msr(s, MSR_RAPL_POWER_UNIT,
+                               kvm_rdmsr_rapl_power_unit, NULL);
+            if (!r) {
+                error_report("Could not install MSR_RAPL_POWER_UNIT \
+                                handler: %s",
+                             strerror(-ret));
+                exit(1);
+            }
+
+            r = kvm_filter_msr(s, MSR_PKG_POWER_LIMIT,
+                               kvm_rdmsr_pkg_power_limit, NULL);
+            if (!r) {
+                error_report("Could not install MSR_PKG_POWER_LIMIT \
+                                handler: %s",
+                             strerror(-ret));
+                exit(1);
+            }
+
+            r = kvm_filter_msr(s, MSR_PKG_POWER_INFO,
+                               kvm_rdmsr_pkg_power_info, NULL);
+            if (!r) {
+                error_report("Could not install MSR_PKG_POWER_INFO \
+                                handler: %s",
+                             strerror(-ret));
+                exit(1);
+            }
+            r = kvm_filter_msr(s, MSR_PKG_ENERGY_STATUS,
+                               kvm_rdmsr_pkg_energy_status, NULL);
+            if (!r) {
+                error_report("Could not install MSR_PKG_ENERGY_STATUS \
+                                handler: %s",
+                             strerror(-ret));
+                exit(1);
+            } else {
+                kvm_msr_energy_thread_init(s, ms);
+            }
+        }
     }
 
     return 0;
diff --git a/target/i386/kvm/meson.build b/target/i386/kvm/meson.build
index 84d9143e6029..16010638df69 100644
--- a/target/i386/kvm/meson.build
+++ b/target/i386/kvm/meson.build
@@ -3,6 +3,7 @@ i386_kvm_ss = ss.source_set()
 i386_kvm_ss.add(files(
   'kvm.c',
   'kvm-cpu.c',
+  'vmsr_energy.c',
 ))
 
 i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen-emu.c'))
diff --git a/target/i386/kvm/vmsr_energy.c b/target/i386/kvm/vmsr_energy.c
new file mode 100644
index 000000000000..9f21ef2250a4
--- /dev/null
+++ b/target/i386/kvm/vmsr_energy.c
@@ -0,0 +1,278 @@
+/*
+ * QEMU KVM support -- x86 virtual energy-related MSR.
+ *
+ * Copyright 2023 Red Hat, Inc. 2023
+ *
+ *  Author:
+ *      Anthony Harivel <aharivel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "vmsr_energy.h"
+#include <stdint.h>
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "io/channel.h"
+#include "io/channel-socket.h"
+
+#define MAX_PATH_LEN 50
+#define MAX_LINE_LEN 500
+
+static char *compute_default_paths(void)
+{
+    g_autofree char *state = qemu_get_local_state_dir();
+
+    return g_build_filename(state, "run", "qemu-vmsr-helper.sock", NULL);
+}
+
+static int vmsr_helper_socket_read(QIOChannel *ioc,
+                                  void *buf, int sz, Error **errp)
+{
+    ssize_t r = qio_channel_read_all(ioc, buf, sz, errp);
+
+    if (r < 0) {
+        object_unref(OBJECT(ioc));
+        ioc = NULL;
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+static int vmsr_helper_socket_write(QIOChannel *ioc,
+                                   int fd,
+                                   const void *buf, int sz, Error **errp)
+{
+    size_t nfds = (fd != -1);
+    while (sz > 0) {
+        struct iovec iov;
+        ssize_t n_written;
+
+        iov.iov_base = (void *)buf;
+        iov.iov_len = sz;
+        n_written = qio_channel_writev_full(QIO_CHANNEL(ioc), &iov, 1,
+                                            nfds ? &fd : NULL, nfds, 0, errp);
+
+        if (n_written <= 0) {
+            assert(n_written != QIO_CHANNEL_ERR_BLOCK);
+            object_unref(OBJECT(ioc));
+            ioc = NULL;
+            return n_written < 0 ? -EINVAL : 0;
+        }
+
+        nfds = 0;
+        buf += n_written;
+        sz -= n_written;
+    }
+
+    return 0;
+}
+
+uint64_t vmsr_read_msr(uint32_t reg, unsigned int cpu_id, uint32_t tid,
+                       const char *path)
+{
+    uint64_t data = 0;
+    char *socket_path = NULL;
+    unsigned int buffer[3];
+
+    if (path == NULL) {
+        socket_path = compute_default_paths();
+    } else {
+        socket_path = g_strdup(path);
+    }
+
+    SocketAddress saddr = {
+        .type = SOCKET_ADDRESS_TYPE_UNIX,
+        .u.q_unix.path = socket_path
+    };
+    QIOChannelSocket *sioc = qio_channel_socket_new();
+    Error *local_err = NULL;
+
+    int r;
+
+    qio_channel_set_name(QIO_CHANNEL(sioc), "vmsr-helper");
+    qio_channel_socket_connect_sync(sioc,
+                                    &saddr,
+                                    &local_err);
+    g_free(socket_path);
+    if (local_err) {
+        goto out_close;
+    }
+
+    /*
+     * Send the required arguments:
+     * 1. RAPL MSR register to read
+     * 2. On which CPU ID
+     * 3. From which vCPU (Thread ID)
+     */
+    buffer[0] = reg;
+    buffer[1] = cpu_id;
+    buffer[2] = tid;
+
+    r = vmsr_helper_socket_write(QIO_CHANNEL(sioc),
+                                 -1,
+                                 &buffer, sizeof(buffer),
+                                 &local_err);
+    if (r < 0) {
+        goto out_close;
+    }
+
+    r = vmsr_helper_socket_read(QIO_CHANNEL(sioc),
+                                &data, sizeof(data),
+                                &local_err);
+    if (r < 0) {
+        data = 0;
+        goto out_close;
+    }
+
+out_close:
+    /* Close socket. */
+    qio_channel_close(QIO_CHANNEL(sioc), NULL);
+    object_unref(OBJECT(sioc));
+    return data;
+}
+
+/* Retrieve the max number of physical CPU on the package */
+unsigned int vmsr_get_maxcpus(unsigned int package_num)
+{
+    int k, ncpus;
+    unsigned int maxcpus;
+    struct bitmask *cpus;
+
+    cpus = numa_allocate_cpumask();
+    ncpus = cpus->size;
+
+    if (numa_node_to_cpus(package_num, cpus) < 0) {
+        return 0;
+    }
+
+    maxcpus = 0;
+    for (k = 0; k < ncpus; k++) {
+        if (numa_bitmask_isbitset(cpus, k)) {
+            maxcpus++;
+        }
+    }
+
+    return maxcpus;
+}
+
+/* Retrieve the maximum number of physical package */
+unsigned int vmsr_get_max_physical_package(unsigned int max_cpus)
+{
+    unsigned int packageCount = 0;
+    int *uniquePackages;
+
+    char filePath[256];
+    FILE *file;
+
+    uniquePackages = g_malloc0(max_cpus * sizeof(int));
+
+    for (int i = 0; ; i++) {
+        snprintf(filePath, sizeof(filePath),
+            "/sys/devices/system/cpu/cpu%d/topology/physical_package_id", i);
+        file = fopen(filePath, "r");
+
+        if (file == NULL) {
+            break;
+        }
+
+        char packageId[10];
+        if (fgets(packageId, sizeof(packageId), file) == NULL) {
+            packageCount = 0;
+        }
+        fclose(file);
+
+        int currentPackageId = atoi(packageId);
+
+        bool isUnique = true;
+        for (int j = 0; j < packageCount; j++) {
+            if (uniquePackages[j] == currentPackageId) {
+                isUnique = false;
+                break;
+            }
+        }
+
+        if (isUnique) {
+            uniquePackages[packageCount] = currentPackageId;
+            packageCount++;
+        }
+    }
+
+    g_free(uniquePackages);
+    return packageCount;
+}
+
+int vmsr_read_thread_stat(struct thread_stat *thread, int pid, int index)
+{
+    char path[MAX_PATH_LEN];
+    snprintf(path, MAX_PATH_LEN, "/proc/%u/task/%d/stat", pid, \
+             thread->thread_id);
+
+    FILE *file = fopen(path, "r");
+    if (file == NULL) {
+        return -1;
+    }
+
+    if (fscanf(file, "%*d (%*[^)]) %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u"
+        " %llu %llu %*d %*d %*d %*d %*d %*d %*u %*u %*d %*u %*u"
+        " %*u %*u %*u %*u %*u %*u %*u %*u %*u %*d %*u %*u %u",
+           &thread->utime[index], &thread->stime[index], &thread->cpu_id) != 3)
+        return -1;
+
+    fclose(file);
+    return 0;
+}
+
+/* Read QEMU stat task folder to retrieve all QEMU threads ID */
+pid_t *vmsr_get_thread_ids(pid_t pid, int *num_threads)
+{
+    char path[100];
+    sprintf(path, "/proc/%d/task", pid);
+
+    DIR *dir = opendir(path);
+    if (dir == NULL) {
+        perror("opendir");
+        return NULL;
+    }
+
+    pid_t *thread_ids = NULL;
+    int thread_count = 0;
+
+    struct dirent *ent;
+    while ((ent = readdir(dir)) != NULL) {
+        if (ent->d_name[0] == '.') {
+            continue;
+        }
+        pid_t tid = atoi(ent->d_name);
+        if (pid != tid) {
+            thread_ids = g_realloc(thread_ids,
+                                 (thread_count + 1) * sizeof(pid_t));
+            thread_ids[thread_count] = tid;
+            thread_count++;
+        }
+    }
+
+    closedir(dir);
+
+    *num_threads = thread_count;
+    return thread_ids;
+}
+
+void vmsr_delta_ticks(thread_stat *thd_stat, int i)
+{
+    thd_stat[i].delta_ticks = (thd_stat[i].utime[1] + thd_stat[i].stime[1])
+                            - (thd_stat[i].utime[0] + thd_stat[i].stime[0]);
+}
+
+double vmsr_get_ratio(package_energy_stat *pkg_stat,
+                        thread_stat *thd_stat,
+                        int maxticks, int i) {
+
+    return (pkg_stat[thd_stat[i].numa_node_id].e_delta / 100.0)
+            * ((100.0 / maxticks) * thd_stat[i].delta_ticks);
+}
+
+
diff --git a/target/i386/kvm/vmsr_energy.h b/target/i386/kvm/vmsr_energy.h
new file mode 100644
index 000000000000..bbce9e888f1f
--- /dev/null
+++ b/target/i386/kvm/vmsr_energy.h
@@ -0,0 +1,82 @@
+/*
+ * QEMU KVM support -- x86 virtual energy-related MSR.
+ *
+ * Copyright 2023 Red Hat, Inc. 2023
+ *
+ *  Author:
+ *      Anthony Harivel <aharivel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef VMSR_ENERGY_H
+#define VMSR_ENERGY_H
+
+#include "qemu/osdep.h"
+
+#include <numa.h>
+
+/*
+ * Define the interval time in micro seconds between 2 samples of
+ * energy related MSRs
+ */
+#define MSR_ENERGY_THREAD_SLEEP_US 1000000.0
+
+/*
+ * Thread statistic
+ * @ thread_id: TID (thread ID)
+ * @ is_vcpu: true is thread is vCPU thread
+ * @ cpu_id: CPU number last executed on
+ * @ vcpu_id: vCPU ID
+ * @ numa_node_id:node number of the CPU
+ * @ utime: amount of clock ticks the thread
+ *          has been scheduled in User mode
+ * @ stime: amount of clock ticks the thread
+ *          has been scheduled in System mode
+ * @ delta_ticks: delta of utime+stime between
+ *          the two samples (before/after sleep)
+ */
+struct thread_stat {
+    unsigned int thread_id;
+    bool is_vcpu;
+    unsigned int cpu_id;
+    unsigned int vcpu_id;
+    unsigned int numa_node_id;
+    unsigned long long *utime;
+    unsigned long long *stime;
+    unsigned long long delta_ticks;
+};
+
+/*
+ * Package statistic
+ * @ e_start: package energy counter before the sleep
+ * @ e_end: package energy counter after the sleep
+ * @ e_delta: delta of package energy counter
+ * @ e_ratio: store the energy ratio of non-vCPU thread
+ * @ nb_vcpu: number of vCPU running on this package
+ */
+struct package_energy_stat {
+    uint64_t e_start;
+    uint64_t e_end;
+    uint64_t e_delta;
+    uint64_t e_ratio;
+    unsigned int nb_vcpu;
+};
+
+typedef struct thread_stat thread_stat;
+typedef struct package_energy_stat package_energy_stat;
+
+uint64_t vmsr_read_msr(uint32_t reg, unsigned int cpu_id,
+                       unsigned int tid, const char *path);
+void vmsr_delta_ticks(thread_stat *thd_stat, int i);
+unsigned int vmsr_get_maxcpus(unsigned int package_num);
+unsigned int vmsr_get_max_physical_package(unsigned int max_cpus);
+int vmsr_read_thread_stat(struct thread_stat *thread, int pid, int index);
+pid_t *vmsr_get_thread_ids(pid_t pid, int *num_threads);
+double vmsr_get_ratio(package_energy_stat *pkg_stat,
+                        thread_stat *thd_stat,
+                        int maxticks, int i);
+
+#endif /* VMSR_ENERGY_H */
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] qio: add support for SO_PEERCRED for socket channel
  2023-10-31 14:46 ` [PATCH v2 1/3] qio: add support for SO_PEERCRED for socket channel Anthony Harivel
@ 2023-11-01 10:20   ` Daniel P. Berrangé
  2023-11-01 14:23     ` Paolo Bonzini
  0 siblings, 1 reply; 11+ messages in thread
From: Daniel P. Berrangé @ 2023-11-01 10:20 UTC (permalink / raw)
  To: Anthony Harivel; +Cc: qemu-devel, pbonzini, mtosatti

On Tue, Oct 31, 2023 at 03:46:01PM +0100, Anthony Harivel wrote:
> The function qio_channel_get_peercred() returns a pointer to the
> credentials of the peer process connected to this socket.
> 
> This credentials structure is defined in <sys/socket.h> as follows:
> 
> struct ucred {
> 	pid_t pid;    /* Process ID of the sending process */
> 	uid_t uid;    /* User ID of the sending process */
> 	gid_t gid;    /* Group ID of the sending process */
> };
> 
> The use of this function is possible only for connected AF_UNIX stream
> sockets and for AF_UNIX stream and datagram socket pairs.
> 
> Signed-off-by: Anthony Harivel <aharivel@redhat.com>
> ---
>  include/io/channel.h | 20 ++++++++++++++++++++
>  io/channel-socket.c  | 17 +++++++++++++++++
>  io/channel.c         | 12 ++++++++++++
>  3 files changed, 49 insertions(+)
> 
> diff --git a/include/io/channel.h b/include/io/channel.h
> index 5f9dbaab65b0..99c02d61c3d9 100644
> --- a/include/io/channel.h
> +++ b/include/io/channel.h
> @@ -149,6 +149,9 @@ struct QIOChannelClass {
>                                    void *opaque);
>      int (*io_flush)(QIOChannel *ioc,
>                      Error **errp);
> +    void (*io_peercred)(QIOChannel *ioc,
> +                        struct ucred *cred,
> +                        Error **errp);

This isn't going to fly. 'struct ucred' is Linux specific, so this won't
compile on macOS, Windows, *BSD, and we don't really want a huge #ifdef
ladder in these APIs. This will need to explode the struct and return
the individual fields that are present instead, and the impl side must
compile on other OS, even if its just stubbed out to return an error.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/3] tools: build qemu-vmsr-helper
  2023-10-31 14:46 ` [PATCH v2 2/3] tools: build qemu-vmsr-helper Anthony Harivel
@ 2023-11-01 10:36   ` Daniel P. Berrangé
  2023-11-01 10:52   ` Daniel P. Berrangé
  2023-11-01 14:32   ` Paolo Bonzini
  2 siblings, 0 replies; 11+ messages in thread
From: Daniel P. Berrangé @ 2023-11-01 10:36 UTC (permalink / raw)
  To: Anthony Harivel; +Cc: qemu-devel, pbonzini, mtosatti

On Tue, Oct 31, 2023 at 03:46:02PM +0100, Anthony Harivel wrote:
> Introduce a privileged helper to access RAPL MSR.
> 
> The privileged helper tool, qemu-vmsr-helper, is designed to provide
> virtual machines with the ability to read specific RAPL (Running Average
> Power Limit) MSRs without requiring CAP_SYS_RAWIO privileges or relying
> on external, out-of-tree patches.
> 
> The helper tool leverages Unix permissions and SO_PEERCRED socket
> options to enforce access control, ensuring that only processes
> explicitly requesting read access via readmsr() from a valid Thread ID
> can access these MSRs.
> 
> The list of RAPL MSRs that are allowed to be read by the helper tool is
> defined in rapl-msr-index.h. This list corresponds to the RAPL MSRs that
> will be supported in the next commit titled "Add support for RAPL MSRs
> in KVM/QEMU."
> 
> Signed-off-by: Anthony Harivel <aharivel@redhat.com>
> ---
>  docs/tools/index.rst            |   1 +
>  docs/tools/qemu-vmsr-helper.rst |  89 ++++++
>  meson.build                     |   5 +
>  tools/i386/qemu-vmsr-helper.c   | 507 ++++++++++++++++++++++++++++++++
>  tools/i386/rapl-msr-index.h     |  28 ++

Missing  .service and .socket unit files needed for systemd integration.
See those present for qemu-pr-helper for examples

>  5 files changed, 630 insertions(+)
>  create mode 100644 docs/tools/qemu-vmsr-helper.rst
>  create mode 100644 tools/i386/qemu-vmsr-helper.c
>  create mode 100644 tools/i386/rapl-msr-index.h





> diff --git a/tools/i386/qemu-vmsr-helper.c b/tools/i386/qemu-vmsr-helper.c
> new file mode 100644
> index 000000000000..1d82a2753e44
> --- /dev/null
> +++ b/tools/i386/qemu-vmsr-helper.c

> +#define MAX_PATH_LEN 50
> +#define MAX_LINE_LEN 500
> +#define PR_OUT_FIXED_PARAM_SIZE 24

These latter two constants are not used.

> +
> +static char *socket_path;
> +static char *pidfile;
> +static enum { RUNNING, TERMINATE, TERMINATING } state;
> +static QIOChannelSocket *server_ioc;
> +static int server_watch;
> +static int num_active_sockets = 1;
> +static int noisy;
> +static int verbose;
> +
> +#ifdef CONFIG_LIBCAP_NG
> +static int uid = -1;
> +static int gid = -1;
> +#endif
> +
> +static void compute_default_paths(void)
> +{
> +    g_autofree char *state = qemu_get_local_state_dir();
> +
> +    socket_path = g_build_filename(state, "run", "qemu-vmsr-helper.sock", NULL);
> +    pidfile = g_build_filename(state, "run", "qemu-vmsr-helper.pid", NULL);
> +}
> +
> +/*
> + * Check if the TID that request the MSR read
> + * belongs to the peer. It should a TID of a vCPU.
> + */
> +static bool is_tid_present(pid_t pid, pid_t tid)
> +{
> +    char pidStr[20];
> +    char tidStr[20];
> +
> +    snprintf(pidStr, sizeof(pidStr), "%d", pid);
> +    snprintf(tidStr, sizeof(tidStr), "%d", tid);

g_strdup_printf() not an unchecked static buffer.

> +
> +    char pidPath[256];
> +    char tidPath[256];
> +
> +    snprintf(pidPath, sizeof(pidPath), "/proc/%s", pidStr);

You never read this variable once built

> +    snprintf(tidPath, sizeof(tidPath), "/proc/%s/task/%s", pidStr, tidStr);

g_build_file_name

> +
> +    /* Check if the TID directory exists within the PID directory */
> +    if (access(tidPath, F_OK) == 0) {
> +        return true;
> +    }
> +
> +    return false;
> +}

> +static uint64_t vmsr_read_msr(uint32_t reg, unsigned int cpu_id)
> +{
> +    int fd;
> +    uint64_t data;
> +
> +    char path[MAX_PATH_LEN];
> +    snprintf(path, MAX_PATH_LEN, "/dev/cpu/%u/msr", cpu_id);

g_strdup_printf

> +
> +    fd = open(path , O_RDONLY);
> +    if (fd < 0) {
> +        return 0;
> +    }
> +    if (pread(fd, &data, sizeof data, reg) != sizeof data) {
> +        data = 0;
> +    }
> +
> +    close(fd);
> +    return data;
> +}
> +
> +static void usage(const char *name)
> +{
> +    (printf) (
> +"Usage: %s [OPTIONS] FILE\n"
> +"Virtual RAPL MSR helper program for QEMU\n"
> +"\n"
> +"  -h, --help                display this help and exit\n"
> +"  -V, --version             output version information and exit\n"
> +"\n"
> +"  -d, --daemon              run in the background\n"
> +"  -f, --pidfile=PATH        PID file when running as a daemon\n"
> +"                            (default '%s')\n"
> +"  -k, --socket=PATH         path to the unix socket\n"
> +"                            (default '%s')\n"
> +"  -T, --trace [[enable=]<pattern>][,events=<file>][,file=<file>]\n"
> +"                            specify tracing options\n"
> +#ifdef CONFIG_LIBCAP_NG
> +"  -u, --user=USER           user to drop privileges to\n"
> +"  -g, --group=GROUP         group to drop privileges to\n"
> +#endif
> +"\n"
> +QEMU_HELP_BOTTOM "\n"
> +    , name, pidfile, socket_path);
> +}
> +
> +static void version(const char *name)
> +{
> +    printf(
> +"%s " QEMU_FULL_VERSION "\n"
> +"Written by Anthony Harivel.\n"
> +"\n"
> +QEMU_COPYRIGHT "\n"
> +"This is free software; see the source for copying conditions.  There is NO\n"
> +"warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.\n"
> +    , name);
> +}
> +
> +typedef struct VMSRHelperClient {
> +    QIOChannelSocket *ioc;
> +    Coroutine *co;
> +    int fd;

This FD field is set to -1 and never changed / read thereafter AFAICS,
so please remove it.

> +} VMSRHelperClient;
> +
> +static void coroutine_fn vh_co_entry(void *opaque)
> +{
> +    VMSRHelperClient *client = opaque;
> +    uint64_t vmsr;
> +    uint32_t request[3];
> +    struct ucred cred;
> +    int r;
> +    Error *local_err = NULL;
> +
> +    qio_channel_set_blocking(QIO_CHANNEL(client->ioc),
> +                             false, NULL);
> +
> +    qio_channel_set_follow_coroutine_ctx(QIO_CHANNEL(client->ioc), true);
> +
> +    /*
> +     * Check peer credentials
> +     * Only QEMU PID/TID are allowed
> +     */
> +    qio_channel_get_peercred(QIO_CHANNEL(client->ioc), &cred, &local_err);
> +
> +    if (cred.pid == 0) {
> +        if (local_err != NULL) {
> +            error_report_err(local_err);
> +        }
> +        error_report("Failed to get peer credentials");
> +        goto out;
> +    }
> +
> +    /*
> +     * Read the requested MSR
> +     * Only RAPL MSR in rapl-msr-index.h is allowed
> +     */
> +    r = qio_channel_read_all(QIO_CHANNEL(client->ioc),
> +                             (char *) &request, sizeof(request), NULL);

The 'NULL' should be &local_err, so that the next error_report
line actually gives a useful error message with actual details.

> +    if (!is_msr_allowed(request[0]) || r < 0) {

request[0] is accessing uninitialized memory in case that r < 0.
You must check for errors first.

> +        error_report("Read request fail: %d, %d", request[0], request[1]);

use error_report_err in the 'r < 0' case.

> +        goto out;
> +    }
> +
> +    vmsr = vmsr_read_msr(request[0], request[1]);
> +
> +    if (!is_tid_present(cred.pid, request[2])) {
> +        error_report("requested TID not in peer PID");
> +        vmsr = 0;
> +    }
> +
> +    r = qio_channel_write_all(QIO_CHANNEL(client->ioc),
> +                         (char *) &vmsr, sizeof(vmsr), NULL);

Again you must pass &local_err...

> +    if (r < 0) {
> +        error_report("write vmsr failed");

and use error_report_err

> +        goto out;
> +    }
> +
> +out:
> +    object_unref(OBJECT(client->ioc));
> +    g_free(client);
> +}
> +
> +static gboolean accept_client(QIOChannel *ioc,
> +                              GIOCondition cond,
> +                              gpointer opaque)
> +{
> +    QIOChannelSocket *cioc;
> +    VMSRHelperClient *vmsrh;
> +
> +    cioc = qio_channel_socket_accept(QIO_CHANNEL_SOCKET(ioc),
> +                                     NULL);
> +    if (!cioc) {
> +        return TRUE;
> +    }
> +
> +    vmsrh = g_new(VMSRHelperClient, 1);
> +    vmsrh->ioc = cioc;
> +    vmsrh->fd = -1;
> +    vmsrh->co = qemu_coroutine_create(vh_co_entry, vmsrh);
> +    qemu_coroutine_enter(vmsrh->co);
> +
> +    return TRUE;
> +}
> +
> +static void termsig_handler(int signum)
> +{
> +    qatomic_cmpxchg(&state, RUNNING, TERMINATE);
> +    qemu_notify_event();
> +}
> +
> +static void close_server_socket(void)
> +{
> +    assert(server_ioc);
> +
> +    g_source_remove(server_watch);
> +    server_watch = -1;
> +    object_unref(OBJECT(server_ioc));
> +    num_active_sockets--;
> +}
> +
> +#ifdef CONFIG_LIBCAP_NG
> +static int drop_privileges(void)
> +{
> +    /* clear all capabilities */
> +    capng_clear(CAPNG_SELECT_BOTH);
> +
> +    if (capng_update(CAPNG_ADD, CAPNG_EFFECTIVE | CAPNG_PERMITTED,
> +                     CAP_SYS_RAWIO) < 0) {
> +        return -1;
> +    }
> +
> +    /*
> +     * Change user/group id, retaining the capabilities.
> +     * Because file descriptors are passed via SCM_RIGHTS,
> +     * we don't need supplementary groups (and in fact the helper
> +     * can run as "nobody").
> +     */
> +    if (capng_change_id(uid != -1 ? uid : getuid(),
> +                        gid != -1 ? gid : getgid(),
> +                        CAPNG_DROP_SUPP_GRP | CAPNG_CLEAR_BOUNDING)) {
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +#endif
> +
> +int main(int argc, char **argv)
> +{
> +    const char *sopt = "hVk:f:dT:u:g:vq";
> +    struct option lopt[] = {
> +        { "help", no_argument, NULL, 'h' },
> +        { "version", no_argument, NULL, 'V' },
> +        { "socket", required_argument, NULL, 'k' },
> +        { "pidfile", required_argument, NULL, 'f' },
> +        { "daemon", no_argument, NULL, 'd' },
> +        { "trace", required_argument, NULL, 'T' },
> +        { "user", required_argument, NULL, 'u' },
> +        { "group", required_argument, NULL, 'g' },
> +        { "verbose", no_argument, NULL, 'v' },
> +        { "quiet", no_argument, NULL, 'q' },
> +        { NULL, 0, NULL, 0 }
> +    };
> +    int opt_ind = 0;
> +    int loglevel = 1;
> +    int quiet = 0;
> +    int ch;
> +    Error *local_err = NULL;
> +    bool daemonize = false;
> +    bool pidfile_specified = false;
> +    bool socket_path_specified = false;
> +    unsigned socket_activation;
> +
> +    struct sigaction sa_sigterm;
> +    memset(&sa_sigterm, 0, sizeof(sa_sigterm));
> +    sa_sigterm.sa_handler = termsig_handler;
> +    sigaction(SIGTERM, &sa_sigterm, NULL);
> +    sigaction(SIGINT, &sa_sigterm, NULL);
> +    sigaction(SIGHUP, &sa_sigterm, NULL);
> +
> +    signal(SIGPIPE, SIG_IGN);
> +
> +    error_init(argv[0]);
> +    module_call_init(MODULE_INIT_TRACE);
> +    module_call_init(MODULE_INIT_QOM);
> +    qemu_add_opts(&qemu_trace_opts);
> +    qemu_init_exec_dir(argv[0]);
> +
> +    compute_default_paths();
> +
> +    while ((ch = getopt_long(argc, argv, sopt, lopt, &opt_ind)) != -1) {
> +        switch (ch) {
> +        case 'k':
> +            g_free(socket_path);
> +            socket_path = g_strdup(optarg);
> +            socket_path_specified = true;
> +            if (socket_path[0] != '/') {
> +                error_report("socket path must be absolute");
> +                exit(EXIT_FAILURE);
> +            }
> +            break;
> +        case 'f':
> +            g_free(pidfile);
> +            pidfile = g_strdup(optarg);
> +            pidfile_specified = true;
> +            break;
> +#ifdef CONFIG_LIBCAP_NG
> +        case 'u': {
> +            unsigned long res;
> +            struct passwd *userinfo = getpwnam(optarg);
> +            if (userinfo) {
> +                uid = userinfo->pw_uid;
> +            } else if (qemu_strtoul(optarg, NULL, 10, &res) == 0 &&
> +                       (uid_t)res == res) {
> +                uid = res;
> +            } else {
> +                error_report("invalid user '%s'", optarg);
> +                exit(EXIT_FAILURE);
> +            }
> +            break;
> +        }
> +        case 'g': {
> +            unsigned long res;
> +            struct group *groupinfo = getgrnam(optarg);
> +            if (groupinfo) {
> +                gid = groupinfo->gr_gid;
> +            } else if (qemu_strtoul(optarg, NULL, 10, &res) == 0 &&
> +                       (gid_t)res == res) {
> +                gid = res;
> +            } else {
> +                error_report("invalid group '%s'", optarg);
> +                exit(EXIT_FAILURE);
> +            }
> +            break;
> +        }
> +#else
> +        case 'u':
> +        case 'g':
> +            error_report("-%c not supported by this %s", ch, argv[0]);
> +            exit(1);
> +#endif
> +        case 'd':
> +            daemonize = true;
> +            break;
> +        case 'q':
> +            quiet = 1;
> +            break;
> +        case 'v':
> +            ++loglevel;
> +            break;
> +        case 'T':
> +            trace_opt_parse(optarg);
> +            break;
> +        case 'V':
> +            version(argv[0]);
> +            exit(EXIT_SUCCESS);
> +            break;
> +        case 'h':
> +            usage(argv[0]);
> +            exit(EXIT_SUCCESS);
> +            break;
> +        case '?':
> +            error_report("Try `%s --help' for more information.", argv[0]);
> +            exit(EXIT_FAILURE);
> +        }
> +    }
> +
> +    /* set verbosity */
> +    noisy = !quiet && (loglevel >= 3);
> +    verbose = quiet ? 0 : MIN(loglevel, 3);

AFAIK nothing ever uses the noisy or verbose flags, so please remove
them or use them.

> +
> +    if (!trace_init_backends()) {
> +        exit(EXIT_FAILURE);
> +    }
> +    trace_init_file();
> +    qemu_set_log(LOG_TRACE, &error_fatal);


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/3] tools: build qemu-vmsr-helper
  2023-10-31 14:46 ` [PATCH v2 2/3] tools: build qemu-vmsr-helper Anthony Harivel
  2023-11-01 10:36   ` Daniel P. Berrangé
@ 2023-11-01 10:52   ` Daniel P. Berrangé
  2023-11-01 14:32   ` Paolo Bonzini
  2 siblings, 0 replies; 11+ messages in thread
From: Daniel P. Berrangé @ 2023-11-01 10:52 UTC (permalink / raw)
  To: Anthony Harivel; +Cc: qemu-devel, pbonzini, mtosatti

On Tue, Oct 31, 2023 at 03:46:02PM +0100, Anthony Harivel wrote:
> Introduce a privileged helper to access RAPL MSR.
> 
> The privileged helper tool, qemu-vmsr-helper, is designed to provide
> virtual machines with the ability to read specific RAPL (Running Average
> Power Limit) MSRs without requiring CAP_SYS_RAWIO privileges or relying
> on external, out-of-tree patches.
> 
> The helper tool leverages Unix permissions and SO_PEERCRED socket
> options to enforce access control, ensuring that only processes
> explicitly requesting read access via readmsr() from a valid Thread ID
> can access these MSRs.
> 
> The list of RAPL MSRs that are allowed to be read by the helper tool is
> defined in rapl-msr-index.h. This list corresponds to the RAPL MSRs that
> will be supported in the next commit titled "Add support for RAPL MSRs
> in KVM/QEMU."
> 
> Signed-off-by: Anthony Harivel <aharivel@redhat.com>
> ---
>  docs/tools/index.rst            |   1 +
>  docs/tools/qemu-vmsr-helper.rst |  89 ++++++
>  meson.build                     |   5 +
>  tools/i386/qemu-vmsr-helper.c   | 507 ++++++++++++++++++++++++++++++++
>  tools/i386/rapl-msr-index.h     |  28 ++
>  5 files changed, 630 insertions(+)
>  create mode 100644 docs/tools/qemu-vmsr-helper.rst
>  create mode 100644 tools/i386/qemu-vmsr-helper.c
>  create mode 100644 tools/i386/rapl-msr-index.h
> 


> +/*
> + * Check if the TID that request the MSR read
> + * belongs to the peer. It should a TID of a vCPU.
> + */
> +static bool is_tid_present(pid_t pid, pid_t tid)
> +{
> +    char pidStr[20];
> +    char tidStr[20];
> +
> +    snprintf(pidStr, sizeof(pidStr), "%d", pid);
> +    snprintf(tidStr, sizeof(tidStr), "%d", tid);
> +
> +    char pidPath[256];
> +    char tidPath[256];
> +
> +    snprintf(pidPath, sizeof(pidPath), "/proc/%s", pidStr);
> +    snprintf(tidPath, sizeof(tidPath), "/proc/%s/task/%s", pidStr, tidStr);
> +
> +    /* Check if the TID directory exists within the PID directory */
> +    if (access(tidPath, F_OK) == 0) {
> +        return true;
> +    }
> +
> +    return false;
> +}
> +
> +/*
> + * Only the RAPL MSR in target/i386/cpu.h are allowed
> + */
> +static bool is_msr_allowed(uint32_t reg)
> +{
> +    switch (reg) {
> +    case MSR_RAPL_POWER_UNIT:
> +    case MSR_PKG_POWER_LIMIT:
> +    case MSR_PKG_ENERGY_STATUS:
> +    case MSR_PKG_POWER_INFO:
> +        return true;
> +    default:
> +        return false;
> +    }
> +}

> +    /*
> +     * Check peer credentials
> +     * Only QEMU PID/TID are allowed

This says only QEMU is allowed

> +     */
> +    qio_channel_get_peercred(QIO_CHANNEL(client->ioc), &cred, &local_err);
> +
> +    if (cred.pid == 0) {
> +        if (local_err != NULL) {
> +            error_report_err(local_err);
> +        }
> +        error_report("Failed to get peer credentials");
> +        goto out;
> +    }
> +
> +    /*
> +     * Read the requested MSR
> +     * Only RAPL MSR in rapl-msr-index.h is allowed
> +     */
> +    r = qio_channel_read_all(QIO_CHANNEL(client->ioc),
> +                             (char *) &request, sizeof(request), NULL);
> +    if (!is_msr_allowed(request[0]) || r < 0) {
> +        error_report("Read request fail: %d, %d", request[0], request[1]);
> +        goto out;
> +    }
> +
> +    vmsr = vmsr_read_msr(request[0], request[1]);
> +
> +    if (!is_tid_present(cred.pid, request[2])) {
> +        error_report("requested TID not in peer PID");
> +        vmsr = 0;
> +    }

This check is merely validating the the thread ID in the message
is a child of the process ID connected to the socket. Any process
on the entire host can satisfy this requirement.

I don't see what is limiting this to only QEMU as claimed earlier,
unless you're expecting the UNIX socket permissions to be such
that only processes under the qemu:qemu user:group pair can
access to the socket ? That would be a libvirt based permissions
assumption though.



> +
> +    r = qio_channel_write_all(QIO_CHANNEL(client->ioc),
> +                         (char *) &vmsr, sizeof(vmsr), NULL);
> +    if (r < 0) {
> +        error_report("write vmsr failed");
> +        goto out;
> +    }
> +
> +out:
> +    object_unref(OBJECT(client->ioc));
> +    g_free(client);
> +}

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/3] Add support for RAPL MSRs in KVM/Qemu
  2023-10-31 14:46 ` [PATCH v2 3/3] Add support for RAPL MSRs in KVM/Qemu Anthony Harivel
@ 2023-11-01 11:24   ` Daniel P. Berrangé
  2023-11-01 14:33   ` Paolo Bonzini
  1 sibling, 0 replies; 11+ messages in thread
From: Daniel P. Berrangé @ 2023-11-01 11:24 UTC (permalink / raw)
  To: Anthony Harivel; +Cc: qemu-devel, pbonzini, mtosatti

On Tue, Oct 31, 2023 at 03:46:03PM +0100, Anthony Harivel wrote:
> Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
> interface (Running Average Power Limit) for advertising the accumulated
> energy consumption of various power domains (e.g. CPU packages, DRAM,
> etc.).
> 
> The consumption is reported via MSRs (model specific registers) like
> MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
> 64 bits registers that represent the accumulated energy consumption in
> micro Joules. They are updated by microcode every ~1ms.
> 
> For now, KVM always returns 0 when the guest requests the value of
> these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
> these MSRs dynamically in userspace.
> 
> To limit the amount of system calls for every MSR call, create a new
> thread in QEMU that updates the "virtual" MSR values asynchronously.
> 
> Each vCPU has its own vMSR to reflect the independence of vCPUs. The
> thread updates the vMSR values with the ratio of energy consumed of
> the whole physical CPU package the vCPU thread runs on and the
> thread's utime and stime values.
> 
> All other non-vCPU threads are also taken into account. Their energy
> consumption is evenly distributed among all vCPUs threads running on
> the same physical CPU package.
> 
> To overcome the problem that reading the RAPL MSR requires priviliged
> access, a socket communication between QEMU and the qemu-vmsr-helper is
> mandatory. You can specified the socket path in the parameter.
> 
> This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock
> 
> Actual limitation:
> - Works only on Intel host CPU because AMD CPUs are using different MSR
>   adresses.
> 
> - Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at
>   the moment.
> 
> - Since each vCPU has an independent vMSR value, the vCPU topology must
>   be changed to match that reality. There must be a single vCPU per
>   virtual socket (e.g.: -smp 4,sockets=4). Accessing pkg-0 energy will
>   give vCPU 0 energy, pkg-1 will give vCPU 1 energy, etc.
> 
> Signed-off-by: Anthony Harivel <aharivel@redhat.com>
> 
>  accel/kvm/kvm-all.c           |  26 +++
>  docs/specs/index.rst          |   1 +
>  docs/specs/rapl-msr.rst       | 131 +++++++++++++++
>  include/sysemu/kvm_int.h      |  12 ++
>  target/i386/cpu.h             |   8 +
>  target/i386/kvm/kvm.c         | 308 ++++++++++++++++++++++++++++++++++
>  target/i386/kvm/meson.build   |   1 +
>  target/i386/kvm/vmsr_energy.c | 278 ++++++++++++++++++++++++++++++
>  target/i386/kvm/vmsr_energy.h |  82 +++++++++
>  9 files changed, 847 insertions(+)
>  create mode 100644 docs/specs/rapl-msr.rst
>  create mode 100644 target/i386/kvm/vmsr_energy.c
>  create mode 100644 target/i386/kvm/vmsr_energy.h
> 


> @@ -3840,6 +3856,16 @@ static void kvm_accel_class_init(ObjectClass *oc, void *data)
>      object_class_property_set_description(oc, "dirty-ring-size",
>          "Size of KVM dirty page ring buffer (default: 0, i.e. use bitmap)");
>  
> +    object_class_property_add_bool(oc, "rapl",
> +                                   NULL,
> +                                   kvm_set_kvm_rapl);
> +    object_class_property_set_description(oc, "rapl",
> +        "Allow energy related MSRs for RAPL interface in Guest");
> +
> +    object_class_property_add_str(oc, "path", NULL,
> +                                  kvm_set_kvm_rapl_socket_path);

"path" is an awfully generic property name to be setting on the
KVM accelerator. This needs to be something like "rapl-helper-socket"

> +    object_class_property_set_description(oc, "path",
> +        "Socket Path for comminucating with the Virtual MSR helper daemon");
>      kvm_arch_accel_class_init(oc);
>  }
>  

> diff --git a/docs/specs/rapl-msr.rst b/docs/specs/rapl-msr.rst
> new file mode 100644
> index 000000000000..ec62a8206337
> --- /dev/null
> +++ b/docs/specs/rapl-msr.rst
> @@ -0,0 +1,131 @@
> +================
> +RAPL MSR support
> +================
> +
> +The RAPL interface (Running Average Power Limit) is advertising the accumulated
> +energy consumption of various power domains (e.g. CPU packages, DRAM, etc.).
> +
> +The consumption is reported via MSRs (model specific registers) like
> +MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are 64 bits
> +registers that represent the accumulated energy consumption in micro Joules.

> +The MSR_PKG_ENERGY_STATUS is a counter; it represents the total amount of
> +energy consumed since the last time the register was cleared. If you multiply
> +it with the UNIT provided above you'll get the power in micro-joules. This
> +counter is always increasing and it increases more or less faster depending on
> +the consumption of the package. This counter is supposed to overflow at some
> +point.

What happens when we live migrate the guest.

This is just as likely to result in values decreasing as increasing, on
the new host. You could call it "overflow" if the new host has a value
lower than the old host, but is there risk that whatever consumes this
data gets confused, if it overflows with an insanely large jump ?

Are there any other consequences of live migration that could cause
problems, requiring a blocker ?


> +High level implementation
> +-------------------------
> +
> +In order to update the value of the virtual MSR, a QEMU thread is created.
> +The thread is basically just an infinity loop that does:
> +
> +1. Snapshot of the time metrics of all QEMU threads (Time spent scheduled in
> +   Userspace and System)
> +
> +2. Snapshot of the actual MSR_PKG_ENERGY_STATUS counter of all packages where
> +   the QEMU threads are running on.
> +
> +3. Sleep for 1 second - During this pause the vcpu and other non-vcpu threads
> +   will do what they have to do and so the energy counter will increase.
> +
> +4. Repeat 2. and 3. and calculate the delta of every metrics representing the
> +   time spent scheduled for each QEMU thread *and* the energy spent by the
> +   packages during the pause.
> +
> +5. Filter the vcpu threads and the non-vcpu threads.
> +
> +6. The total energy spent by the non-vcpu threads is divided by the number
> +   of vcpu threads so that each vcpu thread will get an equal part of the
> +   energy spent by the QEMU workers.
> +
> +7. Calculate the ratio of energy spent per vcpu threads.
> +
> +8. The virtual MSRs are updated for each vcpu by adding for each vcpu the
> +   energy spent + the quote part of the non-vcpu.

This appears to be assuming that each vCPU is 1:1 pinned to a
specific host pCPU. If the vCPU is floating across different
pCPUs, then surely the energy readings will be going forwards
and backwards every time the host moves a vCPU ?

> +
> +9. loop back to 1.
> +
> +Ratio calculation
> +-----------------
> +
> +In Linux, a process has an execution time associated with it. The scheduler is
> +dividing the time in clock ticks. The number of clock ticks per second can be
> +found by the sysconf system call. A typical value of clock ticks per second is
> +100. So a core can run a process at the maximum of 100 ticks per second. If a
> +package has 4 cores, 400 ticks maximum can be scheduled on all the cores
> +of the package for a period of 1 second.
> +
> +The /proc/[pid]/stat [#b]_ is a sysfs file that can give the executed time of a
> +process with the [pid] as the process ID. It gives the amount of ticks the
> +process has been scheduled in userspace (utime) and kernel space (stime).
> +
> +By reading those metrics for a thread, one can calculate the ratio of time the
> +package has spent executing the thread.
> +
> +Example:
> +
> +A 4 cores package can schedule a maximum of 400 ticks per second with 100 ticks
> +per second per core. If a thread was scheduled for 100 ticks between a second
> +on this package, that means my thread has been scheduled for 1/4 of the whole
> +package. With that, the calculation of the energy spent by the thread on this
> +package during this whole second is 1/4 of the total energy spent by the
> +package.

/proc/$PID/stat reports the execution time across all host CPUs
for the process. To cope with vCPUs floating across host pCPUs,
we surely need to sum the energy usage across all packages,
and scale that by the execution time. We can't scale energy of
the individual packages in isolation if we don't have execution
time specific to each package

> +Current Limitations
> +-------------------
> +
> +- Works only on Intel host CPUs because AMD CPUs are using different MSR
> +  addresses.

What about the relation with guest CPU model too ?

If we're exposing MSRs to the guest that are only defined for Intel CPUs,
then we should only expose them, if the guest CPU model vendor == Intel
too. 

> +- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at the
> +  moment.
> +
> +- Since each vCPU has an independent vMSR value, the vCPU topology must be
> +  changed to match that reality. There must be a single vCPU per virtual socket
> +  (e.g.: -smp 4,sockets=4). Accessing pkg-0 energy will give vCPU 0 energy,
> +  pkg-1 will give vCPU 1 energy, etc.

This is quite unfortunate, as in general it is a bad idea to expose
large socket counts to guests. Some OS have punative licensing rules
for large sockets. Generally real world hardware has small socket
counts, with large core counts which is why QEMU changed to expose
cores by default and we recommend mgmt apps todo this too.

The most obvious & accurate scenario to use this functionality is
when doing host CPU passthrough, with strict CPU pinning, and
matching host/guest topology. So I think removing this limitation
is critical to making this feature useful.


> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index e7c054cc160b..6dd8576b6ef6 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c

> @@ -2558,6 +2607,225 @@ static void register_smram_listener(Notifier *n, void *unused)
>                                   &smram_address_space, 1, "kvm-smram");
>  }
>  
> +static void *kvm_msr_energy_thread(void *data)
> +{
> +    KVMState *s = data;
> +    struct KVMMsrEnergy *vmsr = &s->msr_energy;
> +    unsigned int maxpkgs, maxcpus, maxticks;
> +    package_energy_stat *pkg_stat;
> +    int num_threads, tmp_num_threads = 0;
> +    thread_stat *thd_stat;

  g_autofree

> +    CPUState *cpu;
> +    pid_t pid, *thread_ids;

 g_autofree  pid_t *thread_ids;

> +
> +    rcu_register_thread();
> +
> +    /* Get QEMU PID*/
> +    pid = getpid();
> +
> +    /* Assuming those values are the same accross physical system/packages */
> +    /* Nb of CPUS per packages */
> +    maxcpus = vmsr_get_maxcpus(0);
> +    /* Nb of Physical Packages on the system */
> +    maxpkgs = vmsr_get_max_physical_package(maxcpus);
> +
> +    if (maxpkgs == 0) {
> +        return NULL;
> +    }
> +
> +    /* Those MSR values should not change as well */
> +    vmsr->msr_unit  = vmsr_read_msr(MSR_RAPL_POWER_UNIT, 0, pid,
> +                                    s->msr_energy.socket_path);
> +    vmsr->msr_limit = vmsr_read_msr(MSR_PKG_POWER_LIMIT, 0, pid,
> +                                    s->msr_energy.socket_path);
> +    vmsr->msr_info  = vmsr_read_msr(MSR_PKG_POWER_INFO, 0, pid,
> +                                    s->msr_energy.socket_path);
> +
> +    /* Allocate memory for each package energy status */
> +    pkg_stat = (package_energy_stat *)
> +        g_malloc0(maxpkgs * sizeof(package_energy_stat));

Use g_new0(package_energy_stat, maxpkgs)

> +
> +    /* Pre-allocate memory for thread stats */
> +    thd_stat = g_new0(thread_stat, 1);
> +
> +    /*
> +     * Max numbers of ticks per package
> +     * time in second * number of ticks/second * Number of cores / package
> +     * ex: for 100 ticks/second/CPU, 12 CPUs per Package gives 1200 ticks max
> +     */
> +    maxticks = (MSR_ENERGY_THREAD_SLEEP_US / 1000000)
> +                    * sysconf(_SC_CLK_TCK) * maxcpus;
> +
> +    while (true) {
> +
> +        /* Get all qemu threads id */
> +        thread_ids = vmsr_get_thread_ids(pid, &num_threads);
> +
> +        if (thread_ids == NULL) {
> +            goto clean;
> +        }
> +
> +        if (tmp_num_threads < num_threads) {
> +
> +            void *tmp_ptr;
> +
> +            tmp_ptr = g_realloc(thd_stat, num_threads * sizeof(thread_stat));

g_renew(thread_stat, thd_stat, num_threads)

> +            thd_stat = (thread_stat *) tmp_ptr;
> +        }
> +
> +        tmp_num_threads = num_threads;
> +
> +        /* Populate all the thread stats */
> +        for (int i = 0; i < num_threads; i++) {
> +            thd_stat[i].utime = calloc(2, sizeof(unsigned long long));
> +            thd_stat[i].stime = calloc(2, sizeof(unsigned long long));

g_new0

> +            thd_stat[i].thread_id = thread_ids[i];
> +            vmsr_read_thread_stat(&thd_stat[i], pid, 0);
> +            thd_stat[i].numa_node_id = numa_node_of_cpu(thd_stat[i].cpu_id);
> +        }
> +
> +        /* Retrieve all packages power plane energy counter */
> +        for (int i = 0; i <= maxpkgs; i++) {
> +            for (int j = 0; j < num_threads; j++) {
> +                /*
> +                 * Use the first thread we found that ran on the CPU
> +                 * of the package to read the packages energy counter
> +                 */
> +                if (thd_stat[j].numa_node_id == i) {
> +                    pkg_stat[i].e_start =
> +                    vmsr_read_msr(MSR_PKG_ENERGY_STATUS, i, pid,
> +                                  s->msr_energy.socket_path);
> +                    break;
> +                }
> +            }
> +        }
> +
> +        /* Sleep a short period while the other threads are working */
> +        usleep(MSR_ENERGY_THREAD_SLEEP_US);
> +
> +        /*
> +         * Retrieve all packages power plane energy counter
> +         * Calculate the delta of all packages
> +         */
> +        for (int i = 0; i <= maxpkgs; i++) {
> +            for (int j = 0; j < num_threads; j++) {
> +                /*
> +                 * Use the first thread we found that ran on the CPU
> +                 * of the package to read the packages energy counter
> +                 */
> +                if (thd_stat[j].numa_node_id == i) {
> +                    pkg_stat[i].e_end =
> +                       vmsr_read_msr(MSR_PKG_ENERGY_STATUS,
> +                                     thd_stat[j].cpu_id,
> +                                     thd_stat[j].thread_id,
> +                                     s->msr_energy.socket_path);
> +                    pkg_stat[i].e_delta =
> +                        pkg_stat[i].e_end - pkg_stat[i].e_start;
> +                    break;
> +                }
> +            }
> +        }
> +
> +        /* Delta of ticks spend by each thread between the sample */
> +        for (int i = 0; i < num_threads; i++) {
> +            if (vmsr_read_thread_stat(&thd_stat[i], pid, 1) != 0) {
> +                /*
> +                 * We don't count the dead thread
> +                 * i.e threads that existed before the sleep
> +                 * and not anymore
> +                 */
> +                thd_stat[i].delta_ticks = 0;
> +            } else {
> +                vmsr_delta_ticks(thd_stat, i);
> +            }
> +        }
> +
> +        /*
> +         * Identify the vCPU threads
> +         * Calculate the Number of vCPU per package
> +         */
> +        CPU_FOREACH(cpu) {
> +            for (int i = 0; i < num_threads; i++) {
> +                if (cpu->thread_id == thd_stat[i].thread_id) {
> +                    thd_stat[i].is_vcpu = true;
> +                    thd_stat[i].vcpu_id = cpu->cpu_index;
> +                    pkg_stat[thd_stat[i].numa_node_id].nb_vcpu++;
> +                    break;
> +                }
> +            }
> +        }
> +
> +        /* Calculate the total energy of all non-vCPU thread */
> +        for (int i = 0; i < num_threads; i++) {
> +            double temp;
> +            if ((thd_stat[i].is_vcpu != true) &&
> +                (thd_stat[i].delta_ticks > 0)) {
> +                temp = vmsr_get_ratio(pkg_stat, thd_stat, maxticks, i);
> +                pkg_stat[thd_stat[i].numa_node_id].e_ratio
> +                    += (uint64_t)lround(temp);
> +            }
> +        }
> +
> +        /* Calculate the ratio per non-vCPU thread of each package */
> +        for (int i = 0; i <= maxpkgs; i++) {
> +            if (pkg_stat[i].nb_vcpu > 0) {
> +                pkg_stat[i].e_ratio = pkg_stat[i].e_ratio / pkg_stat[i].nb_vcpu;
> +            }
> +        }
> +
> +        /* Calculate the energy for each vCPU thread */
> +        for (int i = 0; i < num_threads; i++) {
> +            double temp;
> +
> +            if ((thd_stat[i].is_vcpu == true) &&
> +                (thd_stat[i].delta_ticks > 0)) {
> +                temp = vmsr_get_ratio(pkg_stat, thd_stat, maxticks, i);
> +                vmsr->msr_value[thd_stat[i].vcpu_id] += (uint64_t)lround(temp);
> +                vmsr->msr_value[thd_stat[i].vcpu_id] \
> +                    += pkg_stat[thd_stat[i].numa_node_id].e_ratio;
> +            }
> +        }
> +
> +        /* free all memory */
> +        for (int i = 0; i < num_threads; i++) {
> +            memset(thd_stat[i].utime, 0, 2 * sizeof(unsigned long long));
> +            memset(thd_stat[i].stime, 0, 2 * sizeof(unsigned long long));
> +        }
> +        /* Zero out the memory */

This comment should be further up, as 'free all memory' is inaccurate

> +        memset(thd_stat, 0, num_threads * sizeof(thread_stat));
> +        memset(thread_ids, 0, sizeof(pid_t));
> +    }
> +
> +clean:
> +    /* free all memory */
> +    for (int i = 0; i < num_threads; i++) {
> +        g_free(thd_stat[i].utime);
> +        g_free(thd_stat[i].stime);
> +    }
> +    g_free(thd_stat);
> +    g_free(thread_ids);

Redundat if using g_aitofree

> +
> +    rcu_unregister_thread();
> +    return NULL;
> +}
> +
> +static int kvm_msr_energy_thread_init(KVMState *s, MachineState *ms)
> +{
> +    struct KVMMsrEnergy *r = &s->msr_energy;
> +
> +    /* Retrieve the number of vCPU */
> +    r->cpus = ms->smp.cpus;
> +
> +    /* Allocate register memory (MSR_PKG_STATUS) for each vCPU */
> +    r->msr_value = calloc(r->cpus, sizeof(r->msr_value));

g_new0 please.

> +
> +    qemu_thread_create(&r->msr_thr, "kvm-msr",
> +                       kvm_msr_energy_thread,
> +                       s, QEMU_THREAD_JOINABLE);
> +
> +    return 0;
> +}
> +
>  int kvm_arch_get_default_type(MachineState *ms)
>  {
>      return 0;
> @@ -2774,6 +3042,46 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>                           strerror(-ret));
>              exit(1);
>          }
> +
> +        if (s->msr_energy.enable == true) {
> +
> +            r = kvm_filter_msr(s, MSR_RAPL_POWER_UNIT,
> +                               kvm_rdmsr_rapl_power_unit, NULL);
> +            if (!r) {
> +                error_report("Could not install MSR_RAPL_POWER_UNIT \
> +                                handler: %s",
> +                             strerror(-ret));
> +                exit(1);
> +            }
> +
> +            r = kvm_filter_msr(s, MSR_PKG_POWER_LIMIT,
> +                               kvm_rdmsr_pkg_power_limit, NULL);
> +            if (!r) {
> +                error_report("Could not install MSR_PKG_POWER_LIMIT \
> +                                handler: %s",
> +                             strerror(-ret));
> +                exit(1);
> +            }
> +
> +            r = kvm_filter_msr(s, MSR_PKG_POWER_INFO,
> +                               kvm_rdmsr_pkg_power_info, NULL);
> +            if (!r) {
> +                error_report("Could not install MSR_PKG_POWER_INFO \
> +                                handler: %s",
> +                             strerror(-ret));
> +                exit(1);
> +            }
> +            r = kvm_filter_msr(s, MSR_PKG_ENERGY_STATUS,
> +                               kvm_rdmsr_pkg_energy_status, NULL);
> +            if (!r) {
> +                error_report("Could not install MSR_PKG_ENERGY_STATUS \
> +                                handler: %s",
> +                             strerror(-ret));
> +                exit(1);
> +            } else {
> +                kvm_msr_energy_thread_init(s, ms);
> +            }
> +        }

Presumably something here should be validating that we're on an Intel
host, and using an Intel guest CPU model.

>      }
>  
>      return 0;

> diff --git a/target/i386/kvm/vmsr_energy.c b/target/i386/kvm/vmsr_energy.c
> new file mode 100644
> index 000000000000..9f21ef2250a4
> --- /dev/null
> +++ b/target/i386/kvm/vmsr_energy.c
> @@ -0,0 +1,278 @@
> +/*
> + * QEMU KVM support -- x86 virtual energy-related MSR.
> + *
> + * Copyright 2023 Red Hat, Inc. 2023
> + *
> + *  Author:
> + *      Anthony Harivel <aharivel@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "vmsr_energy.h"
> +#include <stdint.h>
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "io/channel.h"
> +#include "io/channel-socket.h"
> +
> +#define MAX_PATH_LEN 50
> +#define MAX_LINE_LEN 500
> +
> +static char *compute_default_paths(void)
> +{
> +    g_autofree char *state = qemu_get_local_state_dir();
> +
> +    return g_build_filename(state, "run", "qemu-vmsr-helper.sock", NULL);
> +}
> +
> +static int vmsr_helper_socket_read(QIOChannel *ioc,
> +                                  void *buf, int sz, Error **errp)
> +{
> +    ssize_t r = qio_channel_read_all(ioc, buf, sz, errp);
> +
> +    if (r < 0) {
> +        object_unref(OBJECT(ioc));
> +        ioc = NULL;
> +        return -EINVAL;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vmsr_helper_socket_write(QIOChannel *ioc,
> +                                   int fd,
> +                                   const void *buf, int sz, Error **errp)
> +{
> +    size_t nfds = (fd != -1);
> +    while (sz > 0) {
> +        struct iovec iov;
> +        ssize_t n_written;
> +
> +        iov.iov_base = (void *)buf;
> +        iov.iov_len = sz;
> +        n_written = qio_channel_writev_full(QIO_CHANNEL(ioc), &iov, 1,
> +                                            nfds ? &fd : NULL, nfds, 0, errp);
> +
> +        if (n_written <= 0) {
> +            assert(n_written != QIO_CHANNEL_ERR_BLOCK);
> +            object_unref(OBJECT(ioc));
> +            ioc = NULL;
> +            return n_written < 0 ? -EINVAL : 0;
> +        }
> +
> +        nfds = 0;
> +        buf += n_written;
> +        sz -= n_written;
> +    }

There is only one caller and it is pass -1 for fd. So this entire
method can be deleted. Just call qio_channel_writev_all().


> +uint64_t vmsr_read_msr(uint32_t reg, unsigned int cpu_id, uint32_t tid,
> +                       const char *path)
> +{
> +    uint64_t data = 0;
> +    char *socket_path = NULL;

g_autofree

> +    unsigned int buffer[3];

The server side is expecting strictly 'uint32_t' sized data.

> +
> +    if (path == NULL) {
> +        socket_path = compute_default_paths();
> +    } else {
> +        socket_path = g_strdup(path);
> +    }
> +
> +    SocketAddress saddr = {
> +        .type = SOCKET_ADDRESS_TYPE_UNIX,
> +        .u.q_unix.path = socket_path
> +    };
> +    QIOChannelSocket *sioc = qio_channel_socket_new();
> +    Error *local_err = NULL;
> +
> +    int r;
> +
> +    qio_channel_set_name(QIO_CHANNEL(sioc), "vmsr-helper");
> +    qio_channel_socket_connect_sync(sioc,
> +                                    &saddr,
> +                                    &local_err);
> +    g_free(socket_path);
> +    if (local_err) {
> +        goto out_close;
> +    }
> +
> +    /*
> +     * Send the required arguments:
> +     * 1. RAPL MSR register to read
> +     * 2. On which CPU ID
> +     * 3. From which vCPU (Thread ID)
> +     */
> +    buffer[0] = reg;
> +    buffer[1] = cpu_id;
> +    buffer[2] = tid;
> +
> +    r = vmsr_helper_socket_write(QIO_CHANNEL(sioc),
> +                                 -1,
> +                                 &buffer, sizeof(buffer),
> +                                 &local_err);
> +    if (r < 0) {
> +        goto out_close;
> +    }
> +
> +    r = vmsr_helper_socket_read(QIO_CHANNEL(sioc),
> +                                &data, sizeof(data),
> +                                &local_err);
> +    if (r < 0) {
> +        data = 0;
> +        goto out_close;
> +    }
> +
> +out_close:
> +    /* Close socket. */
> +    qio_channel_close(QIO_CHANNEL(sioc), NULL);
> +    object_unref(OBJECT(sioc));
> +    return data;
> +}

> +/* Retrieve the maximum number of physical package */
> +unsigned int vmsr_get_max_physical_package(unsigned int max_cpus)
> +{
> +    unsigned int packageCount = 0;
> +    int *uniquePackages;

g_autofree int *uniquePackages = NULL;

> +
> +    char filePath[256];
> +    FILE *file;
> +
> +    uniquePackages = g_malloc0(max_cpus * sizeof(int));
> +
> +    for (int i = 0; ; i++) {
> +        snprintf(filePath, sizeof(filePath),
> +            "/sys/devices/system/cpu/cpu%d/topology/physical_package_id", i);

g_build_filename please.

> +        file = fopen(filePath, "r");
> +
> +        if (file == NULL) {
> +            break;
> +        }
> +
> +        char packageId[10];
> +        if (fgets(packageId, sizeof(packageId), file) == NULL) {
> +            packageCount = 0;
> +        }
> +        fclose(file);

Can be simplified with the fixed buffer size removed:

   g_autofree char *packageid = NULL;
   GError err;
   if (!g_file_get_contents(filePath, &packageid, NULL, gerr)) {
       ....report some with 'err'....
   }

> +
> +        int currentPackageId = atoi(packageId);
> +
> +        bool isUnique = true;
> +        for (int j = 0; j < packageCount; j++) {
> +            if (uniquePackages[j] == currentPackageId) {
> +                isUnique = false;
> +                break;
> +            }
> +        }
> +
> +        if (isUnique) {
> +            uniquePackages[packageCount] = currentPackageId;
> +            packageCount++;
> +        }
> +    }
> +
> +    g_free(uniquePackages);

Remove with g_autofree

> +    return packageCount;
> +}
> +
> +int vmsr_read_thread_stat(struct thread_stat *thread, int pid, int index)
> +{
> +    char path[MAX_PATH_LEN];
> +    snprintf(path, MAX_PATH_LEN, "/proc/%u/task/%d/stat", pid, \
> +             thread->thread_id);

g_build_filename

> +
> +    FILE *file = fopen(path, "r");
> +    if (file == NULL) {
> +        return -1;
> +    }
> +
> +    if (fscanf(file, "%*d (%*[^)]) %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u"
> +        " %llu %llu %*d %*d %*d %*d %*d %*d %*u %*u %*d %*u %*u"
> +        " %*u %*u %*u %*u %*u %*u %*u %*u %*u %*d %*u %*u %u",
> +           &thread->utime[index], &thread->stime[index], &thread->cpu_id) != 3)
> +        return -1;
> +
> +    fclose(file);
> +    return 0;
> +}
> +
> +/* Read QEMU stat task folder to retrieve all QEMU threads ID */
> +pid_t *vmsr_get_thread_ids(pid_t pid, int *num_threads)
> +{
> +    char path[100];
> +    sprintf(path, "/proc/%d/task", pid);

g_build_filename

> +
> +    DIR *dir = opendir(path);
> +    if (dir == NULL) {
> +        perror("opendir");
> +        return NULL;
> +    }
> +
> +    pid_t *thread_ids = NULL;
> +    int thread_count = 0;
> +
> +    struct dirent *ent;
> +    while ((ent = readdir(dir)) != NULL) {
> +        if (ent->d_name[0] == '.') {
> +            continue;
> +        }
> +        pid_t tid = atoi(ent->d_name);
> +        if (pid != tid) {
> +            thread_ids = g_realloc(thread_ids,
> +                                 (thread_count + 1) * sizeof(pid_t));

g_renew

> +            thread_ids[thread_count] = tid;
> +            thread_count++;
> +        }
> +    }
> +
> +    closedir(dir);
> +
> +    *num_threads = thread_count;
> +    return thread_ids;
> +}

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] qio: add support for SO_PEERCRED for socket channel
  2023-11-01 10:20   ` Daniel P. Berrangé
@ 2023-11-01 14:23     ` Paolo Bonzini
  0 siblings, 0 replies; 11+ messages in thread
From: Paolo Bonzini @ 2023-11-01 14:23 UTC (permalink / raw)
  To: Daniel P. Berrangé, Anthony Harivel; +Cc: qemu-devel, mtosatti

On 11/1/23 11:20, Daniel P. Berrangé wrote:
> On Tue, Oct 31, 2023 at 03:46:01PM +0100, Anthony Harivel wrote:
>> The function qio_channel_get_peercred() returns a pointer to the
>> credentials of the peer process connected to this socket.
>>
>> This credentials structure is defined in <sys/socket.h> as follows:
>>
>> struct ucred {
>> 	pid_t pid;    /* Process ID of the sending process */
>> 	uid_t uid;    /* User ID of the sending process */
>> 	gid_t gid;    /* Group ID of the sending process */
>> };
>>
>> The use of this function is possible only for connected AF_UNIX stream
>> sockets and for AF_UNIX stream and datagram socket pairs.
>>
>> Signed-off-by: Anthony Harivel <aharivel@redhat.com>
>> ---
>>   include/io/channel.h | 20 ++++++++++++++++++++
>>   io/channel-socket.c  | 17 +++++++++++++++++
>>   io/channel.c         | 12 ++++++++++++
>>   3 files changed, 49 insertions(+)
>>
>> diff --git a/include/io/channel.h b/include/io/channel.h
>> index 5f9dbaab65b0..99c02d61c3d9 100644
>> --- a/include/io/channel.h
>> +++ b/include/io/channel.h
>> @@ -149,6 +149,9 @@ struct QIOChannelClass {
>>                                     void *opaque);
>>       int (*io_flush)(QIOChannel *ioc,
>>                       Error **errp);
>> +    void (*io_peercred)(QIOChannel *ioc,
>> +                        struct ucred *cred,
>> +                        Error **errp);
> 
> This isn't going to fly. 'struct ucred' is Linux specific, so this won't
> compile on macOS, Windows, *BSD, and we don't really want a huge #ifdef
> ladder in these APIs. This will need to explode the struct and return
> the individual fields that are present instead, and the impl side must
> compile on other OS, even if its just stubbed out to return an error.

I would further reduce it to to io_peerpid, because the BSDs can only 
provide the peer uid and gid.

Paolo



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/3] tools: build qemu-vmsr-helper
  2023-10-31 14:46 ` [PATCH v2 2/3] tools: build qemu-vmsr-helper Anthony Harivel
  2023-11-01 10:36   ` Daniel P. Berrangé
  2023-11-01 10:52   ` Daniel P. Berrangé
@ 2023-11-01 14:32   ` Paolo Bonzini
  2 siblings, 0 replies; 11+ messages in thread
From: Paolo Bonzini @ 2023-11-01 14:32 UTC (permalink / raw)
  To: Anthony Harivel, qemu-devel; +Cc: mtosatti, berrange

On 10/31/23 15:46, Anthony Harivel wrote:
> +
> +static uint64_t vmsr_read_msr(uint32_t reg, unsigned int cpu_id)
> +{
> +    int fd;
> +    uint64_t data;
> +
> +    char path[MAX_PATH_LEN];
> +    snprintf(path, MAX_PATH_LEN, "/dev/cpu/%u/msr", cpu_id);

If you allow any CPU here, the thread id is really unused.  You can 
however call sched_getaffinity(), and check that the CPU id is included 
in the thread's affinity.  sched_getaffinity() does not need any extra 
capability.

Paolo



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/3] Add support for RAPL MSRs in KVM/Qemu
  2023-10-31 14:46 ` [PATCH v2 3/3] Add support for RAPL MSRs in KVM/Qemu Anthony Harivel
  2023-11-01 11:24   ` Daniel P. Berrangé
@ 2023-11-01 14:33   ` Paolo Bonzini
  1 sibling, 0 replies; 11+ messages in thread
From: Paolo Bonzini @ 2023-11-01 14:33 UTC (permalink / raw)
  To: Anthony Harivel, qemu-devel; +Cc: mtosatti, berrange

On 10/31/23 15:46, Anthony Harivel wrote:
> +    /* Get QEMU PID*/
> +    pid = getpid();

This should be gettid(), or perhaps a VCPU thread's TID.

> 
> +    /* Those MSR values should not change as well */
> +    vmsr->msr_unit  = vmsr_read_msr(MSR_RAPL_POWER_UNIT, 0, pid,
> +                                    s->msr_energy.socket_path);

The "0" should be sched_getcpu().

I'll review later the way that the measuring thread is created and 
operates, as it's a holiday today here. :)

Paolo

> +    vmsr->msr_limit = vmsr_read_msr(MSR_PKG_POWER_LIMIT, 0, pid,
> +                                    s->msr_energy.socket_path);
> +    vmsr->msr_info  = vmsr_read_msr(MSR_PKG_POWER_INFO, 0, pid,
> +                                    s->msr_energy.socket_path);
> +




^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-11-01 14:34 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-31 14:46 [PATCH v2 0/3] Add support for RAPL MSRs series Anthony Harivel
2023-10-31 14:46 ` [PATCH v2 1/3] qio: add support for SO_PEERCRED for socket channel Anthony Harivel
2023-11-01 10:20   ` Daniel P. Berrangé
2023-11-01 14:23     ` Paolo Bonzini
2023-10-31 14:46 ` [PATCH v2 2/3] tools: build qemu-vmsr-helper Anthony Harivel
2023-11-01 10:36   ` Daniel P. Berrangé
2023-11-01 10:52   ` Daniel P. Berrangé
2023-11-01 14:32   ` Paolo Bonzini
2023-10-31 14:46 ` [PATCH v2 3/3] Add support for RAPL MSRs in KVM/Qemu Anthony Harivel
2023-11-01 11:24   ` Daniel P. Berrangé
2023-11-01 14:33   ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).