[PATCH v3 0/3] Add support for the RAPL MSRs series

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/3] Add support for the RAPL MSRs series
@ 2024-01-25  7:22 Anthony Harivel
  2024-01-25  7:22 ` [PATCH v3 1/3] qio: add support for SO_PEERCRED for socket channel Anthony Harivel
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Anthony Harivel @ 2024-01-25  7:22 UTC (permalink / raw)
  To: pbonzini, mtosatti, berrange; +Cc: qemu-devel, vchundur, Anthony Harivel

Dear maintainers,

First of all, thank you very much for your recent review of my patch 
[1].

In this version (v3), I have attempted to address the most crucial and 
challenging aspect highlighted in your last review.

I am hopeful that we can now engage in a discussion and address the 
remaining potential points that need attention.

Thank you for your continued guidance.

v2 -> v3
--------

- Move all memory allocations from Clib to Glib

- Compile on *BSD (working on Linux only)

- No more limitation on the virtual package: each vCPU that belongs to 
  the same virtual package is giving the same results like expected on 
  a real CPU.
  This has been tested topology like:
     -smp 4,sockets=2
     -smp 16,sockets=4,cores=2,threads=2

v1 -> v2
--------

- To overcome the CVE-2020-8694 a socket communication is created
  to a priviliged helper

- Add the priviliged helper (qemu-vmsr-helper)

- Add SO_PEERCRED in qio channel socket

RFC -> v1
---------

- Add vmsr_* in front of all vmsr specific function

- Change malloc()/calloc()... with all glib equivalent

- Pre-allocate all dynamic memories when possible

- Add a Documentation of implementation, limitation and usage

Best regards,
Anthony

[1]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg1003382.html

Anthony Harivel (3):
  qio: add support for SO_PEERCRED for socket channel
  tools: build qemu-vmsr-helper
  Add support for RAPL MSRs in KVM/Qemu

 accel/kvm/kvm-all.c                      |  27 ++
 contrib/systemd/qemu-vmsr-helper.service |  15 +
 contrib/systemd/qemu-vmsr-helper.socket  |   9 +
 docs/specs/index.rst                     |   1 +
 docs/specs/rapl-msr.rst                  | 133 ++++++
 docs/tools/index.rst                     |   1 +
 docs/tools/qemu-vmsr-helper.rst          |  89 ++++
 include/io/channel.h                     |  21 +
 include/sysemu/kvm_int.h                 |  17 +
 io/channel-socket.c                      |  23 +
 io/channel.c                             |  12 +
 meson.build                              |   5 +
 target/i386/cpu.h                        |   8 +
 target/i386/kvm/kvm.c                    | 348 ++++++++++++++++
 target/i386/kvm/meson.build              |   1 +
 target/i386/kvm/vmsr_energy.c            | 295 +++++++++++++
 target/i386/kvm/vmsr_energy.h            |  87 ++++
 tools/i386/qemu-vmsr-helper.c            | 507 +++++++++++++++++++++++
 tools/i386/rapl-msr-index.h              |  28 ++
 19 files changed, 1627 insertions(+)
 create mode 100644 contrib/systemd/qemu-vmsr-helper.service
 create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
 create mode 100644 docs/specs/rapl-msr.rst
 create mode 100644 docs/tools/qemu-vmsr-helper.rst
 create mode 100644 target/i386/kvm/vmsr_energy.c
 create mode 100644 target/i386/kvm/vmsr_energy.h
 create mode 100644 tools/i386/qemu-vmsr-helper.c
 create mode 100644 tools/i386/rapl-msr-index.h

-- 
2.43.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 1/3] qio: add support for SO_PEERCRED for socket channel
  2024-01-25  7:22 [PATCH v3 0/3] Add support for the RAPL MSRs series Anthony Harivel
@ 2024-01-25  7:22 ` Anthony Harivel
  2024-01-25 16:37   ` Daniel P. Berrangé
  2024-01-25  7:22 ` [PATCH v3 2/3] tools: build qemu-vmsr-helper Anthony Harivel
  2024-01-25  7:22 ` [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu Anthony Harivel
  2 siblings, 1 reply; 32+ messages in thread
From: Anthony Harivel @ 2024-01-25  7:22 UTC (permalink / raw)
  To: pbonzini, mtosatti, berrange; +Cc: qemu-devel, vchundur, Anthony Harivel

The function qio_channel_get_peercred() returns a pointer to the
credentials of the peer process connected to this socket.

This credentials structure is defined in <sys/socket.h> as follows:

struct ucred {
	pid_t pid;    /* Process ID of the sending process */
	uid_t uid;    /* User ID of the sending process */
	gid_t gid;    /* Group ID of the sending process */
};

The use of this function is possible only for connected AF_UNIX stream
sockets and for AF_UNIX stream and datagram socket pairs.

On platform other than Linux, the function return 0.

Signed-off-by: Anthony Harivel <aharivel@redhat.com>
---
 include/io/channel.h | 21 +++++++++++++++++++++
 io/channel-socket.c  | 23 +++++++++++++++++++++++
 io/channel.c         | 12 ++++++++++++
 3 files changed, 56 insertions(+)

diff --git a/include/io/channel.h b/include/io/channel.h
index 5f9dbaab65b0..0413435ce011 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -149,6 +149,9 @@ struct QIOChannelClass {
                                   void *opaque);
     int (*io_flush)(QIOChannel *ioc,
                     Error **errp);
+    void (*io_peerpid)(QIOChannel *ioc,
+                        unsigned int *pid,
+                        Error **errp);
 };
 
 /* General I/O handling functions */
@@ -898,4 +901,22 @@ int coroutine_mixed_fn qio_channel_writev_full_all(QIOChannel *ioc,
 int qio_channel_flush(QIOChannel *ioc,
                       Error **errp);
 
+/**
+ * qio_channel_get_peercred:
+ * @ioc: the channel object
+ * @pid: pointer to pid
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Returns the pid of the peer process connected to this socket.
+ *
+ * The use of this function is possible only for connected
+ * AF_UNIX stream sockets and for AF_UNIX stream and datagram
+ * socket pairs on Linux.
+ * Return 0 for the non-Linux OS.
+ *
+ */
+void qio_channel_get_peerpid(QIOChannel *ioc,
+                              unsigned int *pid,
+                              Error **errp);
+
 #endif /* QIO_CHANNEL_H */
diff --git a/io/channel-socket.c b/io/channel-socket.c
index 3a899b060858..e6a73592650c 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -841,6 +841,28 @@ qio_channel_socket_set_cork(QIOChannel *ioc,
     socket_set_cork(sioc->fd, v);
 }
 
+static void
+qio_channel_socket_get_peerpid(QIOChannel *ioc,
+                               unsigned int *pid,
+                               Error **errp)
+{
+#ifdef CONFIG_LINUX
+    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+    Error *err = NULL;
+    socklen_t len = sizeof(struct ucred);
+
+    struct ucred cred;
+    if (getsockopt(sioc->fd,
+               SOL_SOCKET, SO_PEERCRED,
+               &cred, &len) == -1) {
+        error_setg_errno(&err, errno, "Unable to get peer credentials");
+        error_propagate(errp, err);
+    }
+    *pid = (unsigned int)cred.pid;
+#else
+    *pid = 0;
+#endif
+}
 
 static int
 qio_channel_socket_close(QIOChannel *ioc,
@@ -938,6 +960,7 @@ static void qio_channel_socket_class_init(ObjectClass *klass,
 #ifdef QEMU_MSG_ZEROCOPY
     ioc_klass->io_flush = qio_channel_socket_flush;
 #endif
+    ioc_klass->io_peerpid = qio_channel_socket_get_peerpid;
 }
 
 static const TypeInfo qio_channel_socket_info = {
diff --git a/io/channel.c b/io/channel.c
index 86c5834510ff..a5646650cf72 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -490,6 +490,18 @@ void qio_channel_set_cork(QIOChannel *ioc,
     }
 }
 
+void qio_channel_get_peerpid(QIOChannel *ioc,
+                             unsigned int *pid,
+                             Error **errp)
+{
+    QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+    if (!klass->io_peerpid) {
+        error_setg(errp, "Channel does not support peer pid");
+        return;
+    }
+    klass->io_peerpid(ioc, pid, errp);
+}
 
 off_t qio_channel_io_seek(QIOChannel *ioc,
                           off_t offset,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/3] qio: add support for SO_PEERCRED for socket channel
  2024-01-25  7:22 ` [PATCH v3 1/3] qio: add support for SO_PEERCRED for socket channel Anthony Harivel
@ 2024-01-25 16:37   ` Daniel P. Berrangé
  2024-01-29 19:25     ` Paolo Bonzini
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel P. Berrangé @ 2024-01-25 16:37 UTC (permalink / raw)
  To: Anthony Harivel; +Cc: pbonzini, mtosatti, qemu-devel, vchundur

On Thu, Jan 25, 2024 at 08:22:12AM +0100, Anthony Harivel wrote:
> The function qio_channel_get_peercred() returns a pointer to the
> credentials of the peer process connected to this socket.
> 
> This credentials structure is defined in <sys/socket.h> as follows:
> 
> struct ucred {
> 	pid_t pid;    /* Process ID of the sending process */
> 	uid_t uid;    /* User ID of the sending process */
> 	gid_t gid;    /* Group ID of the sending process */
> };
> 
> The use of this function is possible only for connected AF_UNIX stream
> sockets and for AF_UNIX stream and datagram socket pairs.
> 
> On platform other than Linux, the function return 0.
> 
> Signed-off-by: Anthony Harivel <aharivel@redhat.com>
> ---
>  include/io/channel.h | 21 +++++++++++++++++++++
>  io/channel-socket.c  | 23 +++++++++++++++++++++++
>  io/channel.c         | 12 ++++++++++++
>  3 files changed, 56 insertions(+)
> 
> diff --git a/include/io/channel.h b/include/io/channel.h
> index 5f9dbaab65b0..0413435ce011 100644
> --- a/include/io/channel.h
> +++ b/include/io/channel.h
> @@ -149,6 +149,9 @@ struct QIOChannelClass {
>                                    void *opaque);
>      int (*io_flush)(QIOChannel *ioc,
>                      Error **errp);
> +    void (*io_peerpid)(QIOChannel *ioc,
> +                        unsigned int *pid,
> +                        Error **errp);

Idented 1 space too many

>  };
>  
>  /* General I/O handling functions */
> @@ -898,4 +901,22 @@ int coroutine_mixed_fn qio_channel_writev_full_all(QIOChannel *ioc,
>  int qio_channel_flush(QIOChannel *ioc,
>                        Error **errp);
>  
> +/**
> + * qio_channel_get_peercred:
> + * @ioc: the channel object
> + * @pid: pointer to pid
> + * @errp: pointer to a NULL-initialized error object
> + *
> + * Returns the pid of the peer process connected to this socket.
> + *
> + * The use of this function is possible only for connected
> + * AF_UNIX stream sockets and for AF_UNIX stream and datagram
> + * socket pairs on Linux.
> + * Return 0 for the non-Linux OS.

Update to say this returns an error for other platforms

> + *
> + */
> +void qio_channel_get_peerpid(QIOChannel *ioc,
> +                              unsigned int *pid,
> +                              Error **errp);

Idented 1 space too many

> +
>  #endif /* QIO_CHANNEL_H */
> diff --git a/io/channel-socket.c b/io/channel-socket.c
> index 3a899b060858..e6a73592650c 100644
> --- a/io/channel-socket.c
> +++ b/io/channel-socket.c
> @@ -841,6 +841,28 @@ qio_channel_socket_set_cork(QIOChannel *ioc,
>      socket_set_cork(sioc->fd, v);
>  }
>  
> +static void
> +qio_channel_socket_get_peerpid(QIOChannel *ioc,
> +                               unsigned int *pid,
> +                               Error **errp)
> +{
> +#ifdef CONFIG_LINUX
> +    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
> +    Error *err = NULL;
> +    socklen_t len = sizeof(struct ucred);
> +
> +    struct ucred cred;
> +    if (getsockopt(sioc->fd,
> +               SOL_SOCKET, SO_PEERCRED,
> +               &cred, &len) == -1) {
> +        error_setg_errno(&err, errno, "Unable to get peer credentials");
> +        error_propagate(errp, err);
> +    }
> +    *pid = (unsigned int)cred.pid;
> +#else
> +    *pid = 0;

Defaulting 'pid' to 0 is potentially unsafe, because to a caller it
now appears that the remote party is 'root' and thus implied to be
a privileged account.

We should 'error_setg' for any platform we don't have an impl on,
so the caller will see a fail for any usage.

> +#endif
> +}
>  
>  static int
>  qio_channel_socket_close(QIOChannel *ioc,
> @@ -938,6 +960,7 @@ static void qio_channel_socket_class_init(ObjectClass *klass,
>  #ifdef QEMU_MSG_ZEROCOPY
>      ioc_klass->io_flush = qio_channel_socket_flush;
>  #endif
> +    ioc_klass->io_peerpid = qio_channel_socket_get_peerpid;
>  }
>  
>  static const TypeInfo qio_channel_socket_info = {
> diff --git a/io/channel.c b/io/channel.c
> index 86c5834510ff..a5646650cf72 100644
> --- a/io/channel.c
> +++ b/io/channel.c
> @@ -490,6 +490,18 @@ void qio_channel_set_cork(QIOChannel *ioc,
>      }
>  }
>  
> +void qio_channel_get_peerpid(QIOChannel *ioc,
> +                             unsigned int *pid,
> +                             Error **errp)
> +{
> +    QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
> +
> +    if (!klass->io_peerpid) {
> +        error_setg(errp, "Channel does not support peer pid");
> +        return;
> +    }
> +    klass->io_peerpid(ioc, pid, errp);
> +}
>  
>  off_t qio_channel_io_seek(QIOChannel *ioc,
>                            off_t offset,

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/3] qio: add support for SO_PEERCRED for socket channel
  2024-01-25 16:37   ` Daniel P. Berrangé
@ 2024-01-29 19:25     ` Paolo Bonzini
  2024-01-29 19:30       ` Daniel P. Berrangé
  0 siblings, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2024-01-29 19:25 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: Anthony Harivel, mtosatti, qemu-devel, vchundur

On Thu, Jan 25, 2024 at 5:38 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > +static void
> > +qio_channel_socket_get_peerpid(QIOChannel *ioc,
> > +                               unsigned int *pid,
> > +                               Error **errp)
> > +{
> > +#ifdef CONFIG_LINUX
> > +    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
> > +    Error *err = NULL;
> > +    socklen_t len = sizeof(struct ucred);
> > +
> > +    struct ucred cred;
> > +    if (getsockopt(sioc->fd,
> > +               SOL_SOCKET, SO_PEERCRED,
> > +               &cred, &len) == -1) {
> > +        error_setg_errno(&err, errno, "Unable to get peer credentials");
> > +        error_propagate(errp, err);
> > +    }
> > +    *pid = (unsigned int)cred.pid;
> > +#else
> > +    *pid = 0;
>
> Defaulting 'pid' to 0 is potentially unsafe, because to a caller it
> now appears that the remote party is 'root' and thus implied to be
> a privileged account.

This is a pid, so 0 cannot be confused; however, I agree that
returning an error is better.

Paolo



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/3] qio: add support for SO_PEERCRED for socket channel
  2024-01-29 19:25     ` Paolo Bonzini
@ 2024-01-29 19:30       ` Daniel P. Berrangé
  0 siblings, 0 replies; 32+ messages in thread
From: Daniel P. Berrangé @ 2024-01-29 19:30 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Anthony Harivel, mtosatti, qemu-devel, vchundur

On Mon, Jan 29, 2024 at 08:25:29PM +0100, Paolo Bonzini wrote:
> On Thu, Jan 25, 2024 at 5:38 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > +static void
> > > +qio_channel_socket_get_peerpid(QIOChannel *ioc,
> > > +                               unsigned int *pid,
> > > +                               Error **errp)
> > > +{
> > > +#ifdef CONFIG_LINUX
> > > +    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
> > > +    Error *err = NULL;
> > > +    socklen_t len = sizeof(struct ucred);
> > > +
> > > +    struct ucred cred;
> > > +    if (getsockopt(sioc->fd,
> > > +               SOL_SOCKET, SO_PEERCRED,
> > > +               &cred, &len) == -1) {
> > > +        error_setg_errno(&err, errno, "Unable to get peer credentials");
> > > +        error_propagate(errp, err);
> > > +    }
> > > +    *pid = (unsigned int)cred.pid;
> > > +#else
> > > +    *pid = 0;
> >
> > Defaulting 'pid' to 0 is potentially unsafe, because to a caller it
> > now appears that the remote party is 'root' and thus implied to be
> > a privileged account.
> 
> This is a pid, so 0 cannot be confused; however, I agree that
> returning an error is better.

Opps, face-palm  !



With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 2/3] tools: build qemu-vmsr-helper
  2024-01-25  7:22 [PATCH v3 0/3] Add support for the RAPL MSRs series Anthony Harivel
  2024-01-25  7:22 ` [PATCH v3 1/3] qio: add support for SO_PEERCRED for socket channel Anthony Harivel
@ 2024-01-25  7:22 ` Anthony Harivel
  2024-01-29 18:53   ` Daniel P. Berrangé
  2024-01-25  7:22 ` [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu Anthony Harivel
  2 siblings, 1 reply; 32+ messages in thread
From: Anthony Harivel @ 2024-01-25  7:22 UTC (permalink / raw)
  To: pbonzini, mtosatti, berrange; +Cc: qemu-devel, vchundur, Anthony Harivel

Introduce a privileged helper to access RAPL MSR.

The privileged helper tool, qemu-vmsr-helper, is designed to provide
virtual machines with the ability to read specific RAPL (Running Average
Power Limit) MSRs without requiring CAP_SYS_RAWIO privileges or relying
on external, out-of-tree patches.

The helper tool leverages Unix permissions and SO_PEERCRED socket
options to enforce access control, ensuring that only processes
explicitly requesting read access via readmsr() from a valid Thread ID
can access these MSRs.

The list of RAPL MSRs that are allowed to be read by the helper tool is
defined in rapl-msr-index.h. This list corresponds to the RAPL MSRs that
will be supported in the next commit titled "Add support for RAPL MSRs
in KVM/QEMU."

The tool is intentionally designed to run on the Linux x86 platform.
This initial implementation is tailored for Intel CPUs but can be
extended to support AMD CPUs in the future.

Signed-off-by: Anthony Harivel <aharivel@redhat.com>
---
 contrib/systemd/qemu-vmsr-helper.service |  15 +
 contrib/systemd/qemu-vmsr-helper.socket  |   9 +
 docs/tools/index.rst                     |   1 +
 docs/tools/qemu-vmsr-helper.rst          |  89 ++++
 meson.build                              |   5 +
 tools/i386/qemu-vmsr-helper.c            | 507 +++++++++++++++++++++++
 tools/i386/rapl-msr-index.h              |  28 ++
 7 files changed, 654 insertions(+)
 create mode 100644 contrib/systemd/qemu-vmsr-helper.service
 create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
 create mode 100644 docs/tools/qemu-vmsr-helper.rst
 create mode 100644 tools/i386/qemu-vmsr-helper.c
 create mode 100644 tools/i386/rapl-msr-index.h

diff --git a/contrib/systemd/qemu-vmsr-helper.service b/contrib/systemd/qemu-vmsr-helper.service
new file mode 100644
index 000000000000..8fd397bf79a9
--- /dev/null
+++ b/contrib/systemd/qemu-vmsr-helper.service
@@ -0,0 +1,15 @@
+[Unit]
+Description=Virtual RAPL MSR Daemon for QEMU
+
+[Service]
+WorkingDirectory=/tmp
+Type=simple
+ExecStart=/usr/bin/qemu-vmsr-helper
+PrivateTmp=yes
+ProtectSystem=strict
+ReadWritePaths=/var/run
+RestrictAddressFamilies=AF_UNIX
+Restart=always
+RestartSec=0
+
+[Install]
diff --git a/contrib/systemd/qemu-vmsr-helper.socket b/contrib/systemd/qemu-vmsr-helper.socket
new file mode 100644
index 000000000000..183e8304d6e2
--- /dev/null
+++ b/contrib/systemd/qemu-vmsr-helper.socket
@@ -0,0 +1,9 @@
+[Unit]
+Description=Virtual RAPL MSR helper for QEMU
+
+[Socket]
+ListenStream=/run/qemu-vmsr-helper.sock
+SocketMode=0600
+
+[Install]
+WantedBy=multi-user.target
diff --git a/docs/tools/index.rst b/docs/tools/index.rst
index 8e65ce0dfc7b..33ad438e86f6 100644
--- a/docs/tools/index.rst
+++ b/docs/tools/index.rst
@@ -16,3 +16,4 @@ command line utilities and other standalone programs.
    qemu-pr-helper
    qemu-trace-stap
    virtfs-proxy-helper
+   qemu-vmsr-helper
diff --git a/docs/tools/qemu-vmsr-helper.rst b/docs/tools/qemu-vmsr-helper.rst
new file mode 100644
index 000000000000..6ec87b49d962
--- /dev/null
+++ b/docs/tools/qemu-vmsr-helper.rst
@@ -0,0 +1,89 @@
+==================================
+QEMU virtual RAPL MSR helper
+==================================
+
+Synopsis
+--------
+
+**qemu-vmsr-helper** [*OPTION*]
+
+Description
+-----------
+
+Implements the virtual RAPL MSR helper for QEMU.
+
+Accessing the RAPL (Running Average Power Limit) MSR enables the RAPL powercap
+driver to advertise and monitor the power consumption or accumulated energy
+consumption of different power domains, such as CPU packages, DRAM, and other
+components when available.
+
+However those register are accesible under priviliged access (CAP_SYS_RAWIO).
+QEMU can use an external helper to access those priviliged register.
+
+:program:`qemu-vmsr-helper` is that external helper; it creates a listener
+socket which will accept incoming connections for communication with QEMU.
+
+If you want to run VMs in a setup like this, this helper should be started as a
+system service, and you should read the QEMU manual section on "RAPL MSR
+support" to find out how to configure QEMU to connect to the socket created by
+:program:`qemu-vmsr-helper`.
+
+After connecting to the socket, :program:`qemu-vmsr-helper` can
+optionally drop root privileges, except for those capabilities that
+are needed for its operation.
+
+:program:`qemu-vmsr-helper` can also use the systemd socket activation
+protocol.  In this case, the systemd socket unit should specify a
+Unix stream socket, like this::
+
+    [Socket]
+    ListenStream=/var/run/qemu-vmsr-helper.sock
+
+Options
+-------
+
+.. program:: qemu-vmsr-helper
+
+.. option:: -d, --daemon
+
+  run in the background (and create a PID file)
+
+.. option:: -q, --quiet
+
+  decrease verbosity
+
+.. option:: -v, --verbose
+
+  increase verbosity
+
+.. option:: -f, --pidfile=PATH
+
+  PID file when running as a daemon. By default the PID file
+  is created in the system runtime state directory, for example
+  :file:`/var/run/qemu-vmsr-helper.pid`.
+
+.. option:: -k, --socket=PATH
+
+  path to the socket. By default the socket is created in
+  the system runtime state directory, for example
+  :file:`/var/run/qemu-vmsr-helper.sock`.
+
+.. option:: -T, --trace [[enable=]PATTERN][,events=FILE][,file=FILE]
+
+  .. include:: ../qemu-option-trace.rst.inc
+
+.. option:: -u, --user=USER
+
+  user to drop privileges to
+
+.. option:: -g, --group=GROUP
+
+  group to drop privileges to
+
+.. option:: -h, --help
+
+  Display a help message and exit.
+
+.. option:: -V, --version
+
+  Display version information and exit.
diff --git a/meson.build b/meson.build
index d0329966f1b4..93fc233b0891 100644
--- a/meson.build
+++ b/meson.build
@@ -4015,6 +4015,11 @@ if have_tools
                dependencies: [authz, crypto, io, qom, qemuutil,
                               libcap_ng, mpathpersist],
                install: true)
+
+    executable('qemu-vmsr-helper', files('tools/i386/qemu-vmsr-helper.c'),
+               dependencies: [authz, crypto, io, qom, qemuutil,
+                              libcap_ng, mpathpersist],
+               install: true)
   endif
 
   if have_ivshmem
diff --git a/tools/i386/qemu-vmsr-helper.c b/tools/i386/qemu-vmsr-helper.c
new file mode 100644
index 000000000000..cf7d09bfcab3
--- /dev/null
+++ b/tools/i386/qemu-vmsr-helper.c
@@ -0,0 +1,507 @@
+/*
+ * Privileged RAPL MSR helper commands for QEMU
+ *
+ * Copyright (C) 2023 Red Hat, Inc. <aharivel@redhat.com>
+ *
+ * Author: Anthony Harivel <aharivel@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; under version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include <getopt.h>
+#include <stdbool.h>
+#include <sys/ioctl.h>
+#ifdef CONFIG_LIBCAP_NG
+#include <cap-ng.h>
+#endif
+#include <pwd.h>
+#include <grp.h>
+
+#include "qemu/help-texts.h"
+#include "qapi/error.h"
+#include "qemu/cutils.h"
+#include "qemu/main-loop.h"
+#include "qemu/module.h"
+#include "qemu/error-report.h"
+#include "qemu/config-file.h"
+#include "qemu-version.h"
+#include "qapi/error.h"
+#include "qemu/error-report.h"
+#include "qemu/log.h"
+#include "qemu/systemd.h"
+#include "qapi/util.h"
+#include "io/channel.h"
+#include "io/channel-socket.h"
+#include "trace/control.h"
+#include "qemu-version.h"
+#include "rapl-msr-index.h"
+
+#define MAX_PATH_LEN 256
+#define MSR_PATH_TEMPLATE "/dev/cpu/%u/msr"
+
+static char *socket_path;
+static char *pidfile;
+static enum { RUNNING, TERMINATE, TERMINATING } state;
+static QIOChannelSocket *server_ioc;
+static int server_watch;
+static int num_active_sockets = 1;
+
+#ifdef CONFIG_LIBCAP_NG
+static int uid = -1;
+static int gid = -1;
+#endif
+
+static void compute_default_paths(void)
+{
+    socket_path = g_build_filename("/run", "qemu-vmsr-helper.sock", NULL);
+    pidfile = g_build_filename("/run", "qemu-vmsr-helper.pid", NULL);
+}
+
+/*
+ * Check if the TID that request the MSR read
+ * belongs to the peer. It be should a TID of a vCPU.
+ */
+static bool is_tid_present(pid_t pid, pid_t tid)
+{
+    g_autofree char *pidStr;
+    g_autofree char *tidStr;
+
+    pidStr = g_strdup_printf("%d", pid);
+    tidStr = g_strdup_printf("%d", tid);
+
+    char *tidPath;
+
+    tidPath = g_strdup_printf("/proc/%s/task/%s", pidStr, tidStr);
+
+    /* Check if the TID directory exists within the PID directory */
+    if (access(tidPath, F_OK) == 0) {
+        return true;
+    }
+    error_report("Failed to open /proc at %s", tidPath);
+    return false;
+}
+
+/*
+ * Only the RAPL MSR in target/i386/cpu.h are allowed
+ */
+static bool is_msr_allowed(uint32_t reg)
+{
+    switch (reg) {
+    case MSR_RAPL_POWER_UNIT:
+    case MSR_PKG_POWER_LIMIT:
+    case MSR_PKG_ENERGY_STATUS:
+    case MSR_PKG_POWER_INFO:
+        return true;
+    default:
+        return false;
+    }
+}
+
+static uint64_t vmsr_read_msr(uint32_t msr_register, unsigned int cpu_id)
+{
+    int fd;
+    uint64_t result = 0;
+
+    g_autofree char *path;
+    path = g_new0(char, MAX_PATH_LEN);
+    path = g_strdup_printf(MSR_PATH_TEMPLATE, cpu_id);
+
+    /*
+     * Check if the specified CPU is included in the thread's affinity
+     */
+    cpu_set_t cpu_set;
+    CPU_ZERO(&cpu_set);
+    sched_getaffinity(0, sizeof(cpu_set_t), &cpu_set);
+
+    if (!CPU_ISSET(cpu_id, &cpu_set)) {
+        fprintf(stderr, "CPU %u is not in the thread's affinity.\n", cpu_id);
+        return result;
+    }
+
+    fd = open(path, O_RDONLY);
+    if (fd < 0) {
+        error_report("Failed to open MSR file at %s", path);
+        return result;
+    }
+
+    if (pread(fd, &result, sizeof(result), msr_register) != sizeof(result)) {
+        error_report("Failed to read MSR");
+        result = 0;
+    }
+
+    close(fd);
+    return result;
+}
+
+static void usage(const char *name)
+{
+    (printf) (
+"Usage: %s [OPTIONS] FILE\n"
+"Virtual RAPL MSR helper program for QEMU\n"
+"\n"
+"  -h, --help                display this help and exit\n"
+"  -V, --version             output version information and exit\n"
+"\n"
+"  -d, --daemon              run in the background\n"
+"  -f, --pidfile=PATH        PID file when running as a daemon\n"
+"                            (default '%s')\n"
+"  -k, --socket=PATH         path to the unix socket\n"
+"                            (default '%s')\n"
+"  -T, --trace [[enable=]<pattern>][,events=<file>][,file=<file>]\n"
+"                            specify tracing options\n"
+#ifdef CONFIG_LIBCAP_NG
+"  -u, --user=USER           user to drop privileges to\n"
+"  -g, --group=GROUP         group to drop privileges to\n"
+#endif
+"\n"
+QEMU_HELP_BOTTOM "\n"
+    , name, pidfile, socket_path);
+}
+
+static void version(const char *name)
+{
+    printf(
+"%s " QEMU_FULL_VERSION "\n"
+"Written by Anthony Harivel.\n"
+"\n"
+QEMU_COPYRIGHT "\n"
+"This is free software; see the source for copying conditions.  There is NO\n"
+"warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.\n"
+    , name);
+}
+
+typedef struct VMSRHelperClient {
+    QIOChannelSocket *ioc;
+    Coroutine *co;
+} VMSRHelperClient;
+
+static void coroutine_fn vh_co_entry(void *opaque)
+{
+    VMSRHelperClient *client = opaque;
+    uint64_t vmsr;
+    uint32_t request[3];
+    unsigned int peer_pid;
+    int r;
+    Error *local_err = NULL;
+
+    qio_channel_set_blocking(QIO_CHANNEL(client->ioc),
+                             false, NULL);
+
+    qio_channel_set_follow_coroutine_ctx(QIO_CHANNEL(client->ioc), true);
+
+    /*
+     * Check peer credentials
+     */
+    qio_channel_get_peerpid(QIO_CHANNEL(client->ioc), &peer_pid, &local_err);
+
+    if (peer_pid == 0) {
+        if (local_err != NULL) {
+            error_report_err(local_err);
+        }
+        error_report("Failed to get peer credentials");
+        goto out;
+    }
+
+    /*
+     * Read the requested MSR
+     * Only RAPL MSR in rapl-msr-index.h is allowed
+     */
+    r = qio_channel_read_all(QIO_CHANNEL(client->ioc),
+                             (char *) &request, sizeof(request), &local_err);
+    if ((local_err != NULL) || r < 0) {
+        error_report("Read request fail");
+        error_report_err(local_err);
+        goto out;
+    }
+    if (!is_msr_allowed(request[0])) {
+        error_report("Requested unallowed msr: %d", request[0]);
+        goto out;
+    }
+
+    vmsr = vmsr_read_msr(request[0], request[1]);
+
+    if (!is_tid_present(peer_pid, request[2])) {
+        error_report("Requested TID not in peer PID: %d %d",
+            peer_pid, request[2]);
+        vmsr = 0;
+    }
+
+    r = qio_channel_write_all(QIO_CHANNEL(client->ioc),
+                         (char *) &vmsr, sizeof(vmsr), &local_err);
+    if ((local_err != NULL) || r < 0) {
+        error_report("Write request fail");
+        error_report_err(local_err);
+        goto out;
+    }
+
+out:
+    object_unref(OBJECT(client->ioc));
+    g_free(client);
+}
+
+static gboolean accept_client(QIOChannel *ioc,
+                              GIOCondition cond,
+                              gpointer opaque)
+{
+    QIOChannelSocket *cioc;
+    VMSRHelperClient *vmsrh;
+
+    cioc = qio_channel_socket_accept(QIO_CHANNEL_SOCKET(ioc),
+                                     NULL);
+    if (!cioc) {
+        return TRUE;
+    }
+
+    vmsrh = g_new(VMSRHelperClient, 1);
+    vmsrh->ioc = cioc;
+    vmsrh->co = qemu_coroutine_create(vh_co_entry, vmsrh);
+    qemu_coroutine_enter(vmsrh->co);
+
+    return TRUE;
+}
+
+static void termsig_handler(int signum)
+{
+    qatomic_cmpxchg(&state, RUNNING, TERMINATE);
+    qemu_notify_event();
+}
+
+static void close_server_socket(void)
+{
+    assert(server_ioc);
+
+    g_source_remove(server_watch);
+    server_watch = -1;
+    object_unref(OBJECT(server_ioc));
+    num_active_sockets--;
+}
+
+#ifdef CONFIG_LIBCAP_NG
+static int drop_privileges(void)
+{
+    /* clear all capabilities */
+    capng_clear(CAPNG_SELECT_BOTH);
+
+    if (capng_update(CAPNG_ADD, CAPNG_EFFECTIVE | CAPNG_PERMITTED,
+                     CAP_SYS_RAWIO) < 0) {
+        return -1;
+    }
+
+    /*
+     * Change user/group id, retaining the capabilities.
+     * Because file descriptors are passed via SCM_RIGHTS,
+     * we don't need supplementary groups (and in fact the helper
+     * can run as "nobody").
+     */
+    if (capng_change_id(uid != -1 ? uid : getuid(),
+                        gid != -1 ? gid : getgid(),
+                        CAPNG_DROP_SUPP_GRP | CAPNG_CLEAR_BOUNDING)) {
+        return -1;
+    }
+
+    return 0;
+}
+#endif
+
+int main(int argc, char **argv)
+{
+    const char *sopt = "hVk:f:dT:u:g:vq";
+    struct option lopt[] = {
+        { "help", no_argument, NULL, 'h' },
+        { "version", no_argument, NULL, 'V' },
+        { "socket", required_argument, NULL, 'k' },
+        { "pidfile", required_argument, NULL, 'f' },
+        { "daemon", no_argument, NULL, 'd' },
+        { "trace", required_argument, NULL, 'T' },
+        { "user", required_argument, NULL, 'u' },
+        { "group", required_argument, NULL, 'g' },
+        { "verbose", no_argument, NULL, 'v' },
+        { NULL, 0, NULL, 0 }
+    };
+    int opt_ind = 0;
+    int ch;
+    Error *local_err = NULL;
+    bool daemonize = false;
+    bool pidfile_specified = false;
+    bool socket_path_specified = false;
+    unsigned socket_activation;
+
+    struct sigaction sa_sigterm;
+    memset(&sa_sigterm, 0, sizeof(sa_sigterm));
+    sa_sigterm.sa_handler = termsig_handler;
+    sigaction(SIGTERM, &sa_sigterm, NULL);
+    sigaction(SIGINT, &sa_sigterm, NULL);
+    sigaction(SIGHUP, &sa_sigterm, NULL);
+
+    signal(SIGPIPE, SIG_IGN);
+
+    error_init(argv[0]);
+    module_call_init(MODULE_INIT_TRACE);
+    module_call_init(MODULE_INIT_QOM);
+    qemu_add_opts(&qemu_trace_opts);
+    qemu_init_exec_dir(argv[0]);
+
+    compute_default_paths();
+
+    while ((ch = getopt_long(argc, argv, sopt, lopt, &opt_ind)) != -1) {
+        switch (ch) {
+        case 'k':
+            g_free(socket_path);
+            socket_path = g_strdup(optarg);
+            socket_path_specified = true;
+            if (socket_path[0] != '/') {
+                error_report("socket path must be absolute");
+                exit(EXIT_FAILURE);
+            }
+            break;
+        case 'f':
+            g_free(pidfile);
+            pidfile = g_strdup(optarg);
+            pidfile_specified = true;
+            break;
+#ifdef CONFIG_LIBCAP_NG
+        case 'u': {
+            unsigned long res;
+            struct passwd *userinfo = getpwnam(optarg);
+            if (userinfo) {
+                uid = userinfo->pw_uid;
+            } else if (qemu_strtoul(optarg, NULL, 10, &res) == 0 &&
+                       (uid_t)res == res) {
+                uid = res;
+            } else {
+                error_report("invalid user '%s'", optarg);
+                exit(EXIT_FAILURE);
+            }
+            break;
+        }
+        case 'g': {
+            unsigned long res;
+            struct group *groupinfo = getgrnam(optarg);
+            if (groupinfo) {
+                gid = groupinfo->gr_gid;
+            } else if (qemu_strtoul(optarg, NULL, 10, &res) == 0 &&
+                       (gid_t)res == res) {
+                gid = res;
+            } else {
+                error_report("invalid group '%s'", optarg);
+                exit(EXIT_FAILURE);
+            }
+            break;
+        }
+#else
+        case 'u':
+        case 'g':
+            error_report("-%c not supported by this %s", ch, argv[0]);
+            exit(1);
+#endif
+        case 'd':
+            daemonize = true;
+            break;
+        case 'T':
+            trace_opt_parse(optarg);
+            break;
+        case 'V':
+            version(argv[0]);
+            exit(EXIT_SUCCESS);
+            break;
+        case 'h':
+            usage(argv[0]);
+            exit(EXIT_SUCCESS);
+            break;
+        case '?':
+            error_report("Try `%s --help' for more information.", argv[0]);
+            exit(EXIT_FAILURE);
+        }
+    }
+
+    if (!trace_init_backends()) {
+        exit(EXIT_FAILURE);
+    }
+    trace_init_file();
+    qemu_set_log(LOG_TRACE, &error_fatal);
+
+    socket_activation = check_socket_activation();
+    if (socket_activation == 0) {
+        SocketAddress saddr;
+        saddr = (SocketAddress){
+            .type = SOCKET_ADDRESS_TYPE_UNIX,
+            .u.q_unix.path = socket_path,
+        };
+        server_ioc = qio_channel_socket_new();
+        if (qio_channel_socket_listen_sync(server_ioc, &saddr,
+                                           1, &local_err) < 0) {
+            object_unref(OBJECT(server_ioc));
+            error_report_err(local_err);
+            return 1;
+        }
+    } else {
+        /* Using socket activation - check user didn't use -p etc. */
+        if (socket_path_specified) {
+            error_report("Unix socket can't be set when \
+                        using socket activation");
+            exit(EXIT_FAILURE);
+        }
+
+        /* Can only listen on a single socket.  */
+        if (socket_activation > 1) {
+            error_report("%s does not support socket activation \
+                        with LISTEN_FDS > 1",
+                        argv[0]);
+            exit(EXIT_FAILURE);
+        }
+        server_ioc = qio_channel_socket_new_fd(FIRST_SOCKET_ACTIVATION_FD,
+                                               &local_err);
+        if (server_ioc == NULL) {
+            error_reportf_err(local_err,
+                              "Failed to use socket activation: ");
+            exit(EXIT_FAILURE);
+        }
+    }
+
+    qemu_init_main_loop(&error_fatal);
+
+    server_watch = qio_channel_add_watch(QIO_CHANNEL(server_ioc),
+                                         G_IO_IN,
+                                         accept_client,
+                                         NULL, NULL);
+
+    if (daemonize) {
+        if (daemon(0, 0) < 0) {
+            error_report("Failed to daemonize: %s", strerror(errno));
+            exit(EXIT_FAILURE);
+        }
+    }
+
+    if (daemonize || pidfile_specified) {
+        qemu_write_pidfile(pidfile, &error_fatal);
+    }
+
+#ifdef CONFIG_LIBCAP_NG
+    if (drop_privileges() < 0) {
+        error_report("Failed to drop privileges: %s", strerror(errno));
+        exit(EXIT_FAILURE);
+    }
+#endif
+
+    state = RUNNING;
+    do {
+        main_loop_wait(false);
+        if (state == TERMINATE) {
+            state = TERMINATING;
+            close_server_socket();
+        }
+    } while (num_active_sockets > 0);
+
+    exit(EXIT_SUCCESS);
+}
diff --git a/tools/i386/rapl-msr-index.h b/tools/i386/rapl-msr-index.h
new file mode 100644
index 000000000000..9a7118639ae3
--- /dev/null
+++ b/tools/i386/rapl-msr-index.h
@@ -0,0 +1,28 @@
+/*
+ * Allowed list of MSR for Privileged RAPL MSR helper commands for QEMU
+ *
+ * Copyright (C) 2023 Red Hat, Inc. <aharivel@redhat.com>
+ *
+ * Author: Anthony Harivel <aharivel@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; under version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * Should stay in sync with the RAPL MSR
+ * in target/i386/cpu.h
+ */
+#define MSR_RAPL_POWER_UNIT             0x00000606
+#define MSR_PKG_POWER_LIMIT             0x00000610
+#define MSR_PKG_ENERGY_STATUS           0x00000611
+#define MSR_PKG_POWER_INFO              0x00000614
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 2/3] tools: build qemu-vmsr-helper
  2024-01-25  7:22 ` [PATCH v3 2/3] tools: build qemu-vmsr-helper Anthony Harivel
@ 2024-01-29 18:53   ` Daniel P. Berrangé
  2024-01-29 19:33     ` Paolo Bonzini
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel P. Berrangé @ 2024-01-29 18:53 UTC (permalink / raw)
  To: Anthony Harivel; +Cc: pbonzini, mtosatti, qemu-devel, vchundur

On Thu, Jan 25, 2024 at 08:22:13AM +0100, Anthony Harivel wrote:
> Introduce a privileged helper to access RAPL MSR.
> 
> The privileged helper tool, qemu-vmsr-helper, is designed to provide
> virtual machines with the ability to read specific RAPL (Running Average
> Power Limit) MSRs without requiring CAP_SYS_RAWIO privileges or relying
> on external, out-of-tree patches.
> 
> The helper tool leverages Unix permissions and SO_PEERCRED socket
> options to enforce access control, ensuring that only processes
> explicitly requesting read access via readmsr() from a valid Thread ID
> can access these MSRs.
> 
> The list of RAPL MSRs that are allowed to be read by the helper tool is
> defined in rapl-msr-index.h. This list corresponds to the RAPL MSRs that
> will be supported in the next commit titled "Add support for RAPL MSRs
> in KVM/QEMU."
> 
> The tool is intentionally designed to run on the Linux x86 platform.
> This initial implementation is tailored for Intel CPUs but can be
> extended to support AMD CPUs in the future.
> 
> Signed-off-by: Anthony Harivel <aharivel@redhat.com>
> ---
>  contrib/systemd/qemu-vmsr-helper.service |  15 +
>  contrib/systemd/qemu-vmsr-helper.socket  |   9 +
>  docs/tools/index.rst                     |   1 +
>  docs/tools/qemu-vmsr-helper.rst          |  89 ++++
>  meson.build                              |   5 +
>  tools/i386/qemu-vmsr-helper.c            | 507 +++++++++++++++++++++++
>  tools/i386/rapl-msr-index.h              |  28 ++
>  7 files changed, 654 insertions(+)
>  create mode 100644 contrib/systemd/qemu-vmsr-helper.service
>  create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
>  create mode 100644 docs/tools/qemu-vmsr-helper.rst
>  create mode 100644 tools/i386/qemu-vmsr-helper.c
>  create mode 100644 tools/i386/rapl-msr-index.h
> 
> diff --git a/contrib/systemd/qemu-vmsr-helper.service b/contrib/systemd/qemu-vmsr-helper.service
> new file mode 100644
> index 000000000000..8fd397bf79a9
> --- /dev/null
> +++ b/contrib/systemd/qemu-vmsr-helper.service
> @@ -0,0 +1,15 @@
> +[Unit]
> +Description=Virtual RAPL MSR Daemon for QEMU
> +
> +[Service]
> +WorkingDirectory=/tmp
> +Type=simple
> +ExecStart=/usr/bin/qemu-vmsr-helper
> +PrivateTmp=yes
> +ProtectSystem=strict
> +ReadWritePaths=/var/run
> +RestrictAddressFamilies=AF_UNIX
> +Restart=always
> +RestartSec=0
> +
> +[Install]
> diff --git a/contrib/systemd/qemu-vmsr-helper.socket b/contrib/systemd/qemu-vmsr-helper.socket
> new file mode 100644
> index 000000000000..183e8304d6e2
> --- /dev/null
> +++ b/contrib/systemd/qemu-vmsr-helper.socket
> @@ -0,0 +1,9 @@
> +[Unit]
> +Description=Virtual RAPL MSR helper for QEMU
> +
> +[Socket]
> +ListenStream=/run/qemu-vmsr-helper.sock
> +SocketMode=0600

This mode means that only root can connect to it, which seems to
defeat the purpose of allowing QEMU to run as non-root surely ?


> diff --git a/meson.build b/meson.build
> index d0329966f1b4..93fc233b0891 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -4015,6 +4015,11 @@ if have_tools
>                 dependencies: [authz, crypto, io, qom, qemuutil,
>                                libcap_ng, mpathpersist],
>                 install: true)
> +
> +    executable('qemu-vmsr-helper', files('tools/i386/qemu-vmsr-helper.c'),

I'd suggest 'tools/x86/' since this works fine on 64-bit too

> +               dependencies: [authz, crypto, io, qom, qemuutil,
> +                              libcap_ng, mpathpersist],
> +               install: true)

Shouldn't this executable() call be conditional though, so this
is only built for x86 host targets.

>    endif
>  
>    if have_ivshmem

> diff --git a/tools/i386/qemu-vmsr-helper.c b/tools/i386/qemu-vmsr-helper.c
> new file mode 100644
> index 000000000000..cf7d09bfcab3
> --- /dev/null
> +++ b/tools/i386/qemu-vmsr-helper.c

> +
> +#define MAX_PATH_LEN 256

Drop this constant as it is redundant

> +#define MSR_PATH_TEMPLATE "/dev/cpu/%u/msr"
> +
> +static char *socket_path;
> +static char *pidfile;
> +static enum { RUNNING, TERMINATE, TERMINATING } state;
> +static QIOChannelSocket *server_ioc;
> +static int server_watch;
> +static int num_active_sockets = 1;
> +
> +#ifdef CONFIG_LIBCAP_NG
> +static int uid = -1;
> +static int gid = -1;
> +#endif
> +
> +static void compute_default_paths(void)
> +{
> +    socket_path = g_build_filename("/run", "qemu-vmsr-helper.sock", NULL);
> +    pidfile = g_build_filename("/run", "qemu-vmsr-helper.pid", NULL);
> +}

We shouldn't be hardcoding /run, we need to honour --prefix and
--localstatedir args given to configure.  /var/run is a symlink
to /run so the end result ends up the same AFAIK

> +
> +/*
> + * Check if the TID that request the MSR read
> + * belongs to the peer. It be should a TID of a vCPU.
> + */
> +static bool is_tid_present(pid_t pid, pid_t tid)
> +{
> +    g_autofree char *pidStr;
> +    g_autofree char *tidStr;
> +
> +    pidStr = g_strdup_printf("%d", pid);
> +    tidStr = g_strdup_printf("%d", tid);
> +
> +    char *tidPath;

This needs to be g_autofree too, and as a rule any variable
declared with 'g_autofree' should always be initialized at
the same time as declaration, so put the g_strdup_printf
calls on the same line.

> +
> +    tidPath = g_strdup_printf("/proc/%s/task/%s", pidStr, tidStr);

> +
> +    /* Check if the TID directory exists within the PID directory */
> +    if (access(tidPath, F_OK) == 0) {
> +        return true;
> +    }
> +    error_report("Failed to open /proc at %s", tidPath);
> +    return false;
> +}
> +
> +/*
> + * Only the RAPL MSR in target/i386/cpu.h are allowed
> + */
> +static bool is_msr_allowed(uint32_t reg)
> +{
> +    switch (reg) {
> +    case MSR_RAPL_POWER_UNIT:
> +    case MSR_PKG_POWER_LIMIT:
> +    case MSR_PKG_ENERGY_STATUS:
> +    case MSR_PKG_POWER_INFO:
> +        return true;
> +    default:
> +        return false;
> +    }
> +}
> +
> +static uint64_t vmsr_read_msr(uint32_t msr_register, unsigned int cpu_id)
> +{
> +    int fd;
> +    uint64_t result = 0;
> +
> +    g_autofree char *path;
> +    path = g_new0(char, MAX_PATH_LEN);

This is a memory leak as it is overwritten on the next line.

> +    path = g_strdup_printf(MSR_PATH_TEMPLATE, cpu_id);

Since this constant is only used in one place, I'd prefer to see it
inline here, so an observer can see what format specifiers are relied
upon

> +
> +    /*
> +     * Check if the specified CPU is included in the thread's affinity
> +     */
> +    cpu_set_t cpu_set;
> +    CPU_ZERO(&cpu_set);
> +    sched_getaffinity(0, sizeof(cpu_set_t), &cpu_set);


Using cpu_set_t will break on machines with >= 1024 CPUs.

There are a numbr of painpoints with calling sched_getaffinity so
rather than try to explain them, let me point you to the man page
for sched_getaffinity underthe headeing:

  "Handling systems with large CPU affinity masks"


> +
> +    if (!CPU_ISSET(cpu_id, &cpu_set)) {
> +        fprintf(stderr, "CPU %u is not in the thread's affinity.\n", cpu_id);

Use error_report() for consistency

> +        return result;
> +    }
> +
> +    fd = open(path, O_RDONLY);
> +    if (fd < 0) {
> +        error_report("Failed to open MSR file at %s", path);
> +        return result;
> +    }
> +
> +    if (pread(fd, &result, sizeof(result), msr_register) != sizeof(result)) {

Leaks 'fd'

> +        error_report("Failed to read MSR");
> +        result = 0;
> +    }
> +
> +    close(fd);
> +    return result;
> +}


> +static void coroutine_fn vh_co_entry(void *opaque)
> +{
> +    VMSRHelperClient *client = opaque;
> +    uint64_t vmsr;
> +    uint32_t request[3];
> +    unsigned int peer_pid;
> +    int r;
> +    Error *local_err = NULL;
> +
> +    qio_channel_set_blocking(QIO_CHANNEL(client->ioc),
> +                             false, NULL);
> +
> +    qio_channel_set_follow_coroutine_ctx(QIO_CHANNEL(client->ioc), true);
> +
> +    /*
> +     * Check peer credentials
> +     */
> +    qio_channel_get_peerpid(QIO_CHANNEL(client->ioc), &peer_pid, &local_err);
> +
> +    if (peer_pid == 0) {
> +        if (local_err != NULL) {
> +            error_report_err(local_err);
> +        }
> +        error_report("Failed to get peer credentials");

Using peer_pid == 0 as a substitute for checking if 'local_err' is set
is bogus. Check for 'local_err != NULL'.

> +        goto out;
> +    }
> +
> +    /*
> +     * Read the requested MSR
> +     * Only RAPL MSR in rapl-msr-index.h is allowed
> +     */
> +    r = qio_channel_read_all(QIO_CHANNEL(client->ioc),
> +                             (char *) &request, sizeof(request), &local_err);
> +    if ((local_err != NULL) || r < 0) {

Checking both conditions is redundant, pick one

> +        error_report("Read request fail");
> +        error_report_err(local_err);
> +        goto out;
> +    }
> +    if (!is_msr_allowed(request[0])) {
> +        error_report("Requested unallowed msr: %d", request[0]);
> +        goto out;
> +    }
> +
> +    vmsr = vmsr_read_msr(request[0], request[1]);
> +
> +    if (!is_tid_present(peer_pid, request[2])) {
> +        error_report("Requested TID not in peer PID: %d %d",
> +            peer_pid, request[2]);

Indent is a little off.

You never answered my question from the previous posting of this

This check is merely validating the the thread ID in the message
is a child of the process ID connected to the socket. Any process
on the entire host can satisfy this requirement.

I don't see what is limiting this to only QEMU as claimed by the
commit message, unless you're expecting the UNIX socket permissions
to be such that only processes under the qemu:qemu user:group pair
can access to the socket ? That would be a libvirt based permissions
assumption though.

> +        vmsr = 0;
> +    }
> +
> +    r = qio_channel_write_all(QIO_CHANNEL(client->ioc),
> +                         (char *) &vmsr, sizeof(vmsr), &local_err);

Indent got a little off here too.

> +    if ((local_err != NULL) || r < 0) {

Again redundant checks

> +        error_report("Write request fail");
> +        error_report_err(local_err);
> +        goto out;
> +    }
> +
> +out:
> +    object_unref(OBJECT(client->ioc));
> +    g_free(client);
> +}
> +
> +static gboolean accept_client(QIOChannel *ioc,
> +                              GIOCondition cond,
> +                              gpointer opaque)
> +{
> +    QIOChannelSocket *cioc;
> +    VMSRHelperClient *vmsrh;
> +
> +    cioc = qio_channel_socket_accept(QIO_CHANNEL_SOCKET(ioc),
> +                                     NULL);
> +    if (!cioc) {
> +        return TRUE;
> +    }
> +
> +    vmsrh = g_new(VMSRHelperClient, 1);
> +    vmsrh->ioc = cioc;
> +    vmsrh->co = qemu_coroutine_create(vh_co_entry, vmsrh);
> +    qemu_coroutine_enter(vmsrh->co);
> +
> +    return TRUE;
> +}
> +
> +static void termsig_handler(int signum)
> +{
> +    qatomic_cmpxchg(&state, RUNNING, TERMINATE);
> +    qemu_notify_event();
> +}
> +
> +static void close_server_socket(void)
> +{
> +    assert(server_ioc);
> +
> +    g_source_remove(server_watch);
> +    server_watch = -1;
> +    object_unref(OBJECT(server_ioc));
> +    num_active_sockets--;
> +}
> +
> +#ifdef CONFIG_LIBCAP_NG
> +static int drop_privileges(void)
> +{
> +    /* clear all capabilities */
> +    capng_clear(CAPNG_SELECT_BOTH);
> +
> +    if (capng_update(CAPNG_ADD, CAPNG_EFFECTIVE | CAPNG_PERMITTED,
> +                     CAP_SYS_RAWIO) < 0) {
> +        return -1;
> +    }
> +
> +    /*
> +     * Change user/group id, retaining the capabilities.
> +     * Because file descriptors are passed via SCM_RIGHTS,
> +     * we don't need supplementary groups (and in fact the helper
> +     * can run as "nobody").
> +     */
> +    if (capng_change_id(uid != -1 ? uid : getuid(),
> +                        gid != -1 ? gid : getgid(),
> +                        CAPNG_DROP_SUPP_GRP | CAPNG_CLEAR_BOUNDING)) {
> +        return -1;
> +    }

Does this actually work ?  IIUC, the file that requires privileges
is /dev/cpu/%u/msr, and we're opening that fresh on every request,
so how can this run as anything other than root ?

> +
> +    return 0;
> +}
> +#endif
> +
> +int main(int argc, char **argv)
> +{
> +    const char *sopt = "hVk:f:dT:u:g:vq";
> +    struct option lopt[] = {
> +        { "help", no_argument, NULL, 'h' },
> +        { "version", no_argument, NULL, 'V' },
> +        { "socket", required_argument, NULL, 'k' },
> +        { "pidfile", required_argument, NULL, 'f' },
> +        { "daemon", no_argument, NULL, 'd' },
> +        { "trace", required_argument, NULL, 'T' },
> +        { "user", required_argument, NULL, 'u' },
> +        { "group", required_argument, NULL, 'g' },
> +        { "verbose", no_argument, NULL, 'v' },
> +        { NULL, 0, NULL, 0 }
> +    };
> +    int opt_ind = 0;
> +    int ch;
> +    Error *local_err = NULL;
> +    bool daemonize = false;
> +    bool pidfile_specified = false;
> +    bool socket_path_specified = false;
> +    unsigned socket_activation;
> +
> +    struct sigaction sa_sigterm;
> +    memset(&sa_sigterm, 0, sizeof(sa_sigterm));
> +    sa_sigterm.sa_handler = termsig_handler;
> +    sigaction(SIGTERM, &sa_sigterm, NULL);
> +    sigaction(SIGINT, &sa_sigterm, NULL);
> +    sigaction(SIGHUP, &sa_sigterm, NULL);
> +
> +    signal(SIGPIPE, SIG_IGN);
> +
> +    error_init(argv[0]);
> +    module_call_init(MODULE_INIT_TRACE);
> +    module_call_init(MODULE_INIT_QOM);
> +    qemu_add_opts(&qemu_trace_opts);
> +    qemu_init_exec_dir(argv[0]);
> +
> +    compute_default_paths();
> +
> +    while ((ch = getopt_long(argc, argv, sopt, lopt, &opt_ind)) != -1) {
> +        switch (ch) {
> +        case 'k':
> +            g_free(socket_path);
> +            socket_path = g_strdup(optarg);
> +            socket_path_specified = true;
> +            if (socket_path[0] != '/') {
> +                error_report("socket path must be absolute");
> +                exit(EXIT_FAILURE);
> +            }
> +            break;
> +        case 'f':
> +            g_free(pidfile);
> +            pidfile = g_strdup(optarg);
> +            pidfile_specified = true;
> +            break;
> +#ifdef CONFIG_LIBCAP_NG
> +        case 'u': {
> +            unsigned long res;
> +            struct passwd *userinfo = getpwnam(optarg);
> +            if (userinfo) {
> +                uid = userinfo->pw_uid;
> +            } else if (qemu_strtoul(optarg, NULL, 10, &res) == 0 &&
> +                       (uid_t)res == res) {
> +                uid = res;
> +            } else {
> +                error_report("invalid user '%s'", optarg);
> +                exit(EXIT_FAILURE);
> +            }
> +            break;
> +        }
> +        case 'g': {
> +            unsigned long res;
> +            struct group *groupinfo = getgrnam(optarg);
> +            if (groupinfo) {
> +                gid = groupinfo->gr_gid;
> +            } else if (qemu_strtoul(optarg, NULL, 10, &res) == 0 &&
> +                       (gid_t)res == res) {
> +                gid = res;
> +            } else {
> +                error_report("invalid group '%s'", optarg);
> +                exit(EXIT_FAILURE);
> +            }
> +            break;
> +        }
> +#else
> +        case 'u':
> +        case 'g':
> +            error_report("-%c not supported by this %s", ch, argv[0]);
> +            exit(1);
> +#endif
> +        case 'd':
> +            daemonize = true;
> +            break;
> +        case 'T':
> +            trace_opt_parse(optarg);
> +            break;
> +        case 'V':
> +            version(argv[0]);
> +            exit(EXIT_SUCCESS);
> +            break;
> +        case 'h':
> +            usage(argv[0]);
> +            exit(EXIT_SUCCESS);
> +            break;
> +        case '?':
> +            error_report("Try `%s --help' for more information.", argv[0]);
> +            exit(EXIT_FAILURE);
> +        }
> +    }
> +
> +    if (!trace_init_backends()) {
> +        exit(EXIT_FAILURE);
> +    }
> +    trace_init_file();
> +    qemu_set_log(LOG_TRACE, &error_fatal);
> +
> +    socket_activation = check_socket_activation();
> +    if (socket_activation == 0) {
> +        SocketAddress saddr;
> +        saddr = (SocketAddress){
> +            .type = SOCKET_ADDRESS_TYPE_UNIX,
> +            .u.q_unix.path = socket_path,
> +        };
> +        server_ioc = qio_channel_socket_new();
> +        if (qio_channel_socket_listen_sync(server_ioc, &saddr,
> +                                           1, &local_err) < 0) {
> +            object_unref(OBJECT(server_ioc));
> +            error_report_err(local_err);
> +            return 1;
> +        }
> +    } else {
> +        /* Using socket activation - check user didn't use -p etc. */
> +        if (socket_path_specified) {
> +            error_report("Unix socket can't be set when \
> +                        using socket activation");
> +            exit(EXIT_FAILURE);
> +        }
> +
> +        /* Can only listen on a single socket.  */
> +        if (socket_activation > 1) {
> +            error_report("%s does not support socket activation \
> +                        with LISTEN_FDS > 1",
> +                        argv[0]);
> +            exit(EXIT_FAILURE);
> +        }
> +        server_ioc = qio_channel_socket_new_fd(FIRST_SOCKET_ACTIVATION_FD,
> +                                               &local_err);
> +        if (server_ioc == NULL) {
> +            error_reportf_err(local_err,
> +                              "Failed to use socket activation: ");
> +            exit(EXIT_FAILURE);
> +        }
> +    }
> +
> +    qemu_init_main_loop(&error_fatal);
> +
> +    server_watch = qio_channel_add_watch(QIO_CHANNEL(server_ioc),
> +                                         G_IO_IN,
> +                                         accept_client,
> +                                         NULL, NULL);
> +
> +    if (daemonize) {
> +        if (daemon(0, 0) < 0) {
> +            error_report("Failed to daemonize: %s", strerror(errno));
> +            exit(EXIT_FAILURE);
> +        }
> +    }
> +
> +    if (daemonize || pidfile_specified) {
> +        qemu_write_pidfile(pidfile, &error_fatal);
> +    }
> +
> +#ifdef CONFIG_LIBCAP_NG
> +    if (drop_privileges() < 0) {
> +        error_report("Failed to drop privileges: %s", strerror(errno));
> +        exit(EXIT_FAILURE);
> +    }
> +#endif
> +
> +    state = RUNNING;
> +    do {
> +        main_loop_wait(false);
> +        if (state == TERMINATE) {
> +            state = TERMINATING;
> +            close_server_socket();
> +        }
> +    } while (num_active_sockets > 0);
> +
> +    exit(EXIT_SUCCESS);
> +}
> diff --git a/tools/i386/rapl-msr-index.h b/tools/i386/rapl-msr-index.h
> new file mode 100644
> index 000000000000..9a7118639ae3
> --- /dev/null
> +++ b/tools/i386/rapl-msr-index.h
> @@ -0,0 +1,28 @@
> +/*
> + * Allowed list of MSR for Privileged RAPL MSR helper commands for QEMU
> + *
> + * Copyright (C) 2023 Red Hat, Inc. <aharivel@redhat.com>
> + *
> + * Author: Anthony Harivel <aharivel@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; under version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +/*
> + * Should stay in sync with the RAPL MSR
> + * in target/i386/cpu.h
> + */
> +#define MSR_RAPL_POWER_UNIT             0x00000606
> +#define MSR_PKG_POWER_LIMIT             0x00000610
> +#define MSR_PKG_ENERGY_STATUS           0x00000611
> +#define MSR_PKG_POWER_INFO              0x00000614
> -- 
> 2.43.0
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 2/3] tools: build qemu-vmsr-helper
  2024-01-29 18:53   ` Daniel P. Berrangé
@ 2024-01-29 19:33     ` Paolo Bonzini
  2024-01-29 19:45       ` Daniel P. Berrangé
  2024-03-01 11:08       ` Anthony Harivel
  0 siblings, 2 replies; 32+ messages in thread
From: Paolo Bonzini @ 2024-01-29 19:33 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: Anthony Harivel, mtosatti, qemu-devel, vchundur

On Mon, Jan 29, 2024 at 7:53 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > diff --git a/meson.build b/meson.build
> > index d0329966f1b4..93fc233b0891 100644
> > --- a/meson.build
> > +++ b/meson.build
> > @@ -4015,6 +4015,11 @@ if have_tools
> >                 dependencies: [authz, crypto, io, qom, qemuutil,
> >                                libcap_ng, mpathpersist],
> >                 install: true)
> > +
> > +    executable('qemu-vmsr-helper', files('tools/i386/qemu-vmsr-helper.c'),
>
> I'd suggest 'tools/x86/' since this works fine on 64-bit too

QEMU tends to use i386 in the source to mean both 32- and 64-bit.
Either is fine by me though.

> > +               dependencies: [authz, crypto, io, qom, qemuutil,
> > +                              libcap_ng, mpathpersist],
> > +               install: true)
>
> Shouldn't this executable() call be conditional though, so this
> is only built for x86 host targets.

Yes. Also should be 32- and 64-bit (careful because Meson uses 'x86' for
32-bit).


> > +static void compute_default_paths(void)
> > +{
> > +    socket_path = g_build_filename("/run", "qemu-vmsr-helper.sock", NULL);
> > +    pidfile = g_build_filename("/run", "qemu-vmsr-helper.pid", NULL);
> > +}
>
> We shouldn't be hardcoding /run, we need to honour --prefix and
> --localstatedir args given to configure.  /var/run is a symlink
> to /run so the end result ends up the same AFAIK

Indeed; just copy from scsi/qemu-pr-helper.c.

> You never answered my question from the previous posting of this
>
> This check is merely validating the the thread ID in the message
> is a child of the process ID connected to the socket. Any process
> on the entire host can satisfy this requirement.
>
> I don't see what is limiting this to only QEMU as claimed by the
> commit message, unless you're expecting the UNIX socket permissions
> to be such that only processes under the qemu:qemu user:group pair
> can access to the socket ? That would be a libvirt based permissions
> assumption though.

Yes, this is why the systemd socket uses 600, like
contrib/systemd/qemu-pr-helper.socket. The socket can be passed via
SCM_RIGHTS by libvirt, or its permissions can be changed (e.g. 660 and
root:kvm would make sense on a Debian system), or a separate helper
can be started by libvirt.

Either way, the policy is left to the user rather than embedding it in
the provided systemd unit.

> > +    if (capng_update(CAPNG_ADD, CAPNG_EFFECTIVE | CAPNG_PERMITTED,
> > +                     CAP_SYS_RAWIO) < 0) {
> > +        return -1;
> > +    }
> > +
> > +    /*
> > +     * Change user/group id, retaining the capabilities.
> > +     * Because file descriptors are passed via SCM_RIGHTS,
> > +     * we don't need supplementary groups (and in fact the helper
> > +     * can run as "nobody").
> > +     */
> > +    if (capng_change_id(uid != -1 ? uid : getuid(),
> > +                        gid != -1 ? gid : getgid(),
> > +                        CAPNG_DROP_SUPP_GRP | CAPNG_CLEAR_BOUNDING)) {
> > +        return -1;
> > +    }
>
> Does this actually work ?  IIUC, the file that requires privileges
> is /dev/cpu/%u/msr, and we're opening that fresh on every request,
> so how can this run as anything other than root ?

Agreed, the capabilities can be dropped but the uid and gid cannot.

Paolo



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 2/3] tools: build qemu-vmsr-helper
  2024-01-29 19:33     ` Paolo Bonzini
@ 2024-01-29 19:45       ` Daniel P. Berrangé
  2024-01-29 19:53         ` Daniel P. Berrangé
  2024-02-21 13:19         ` Anthony Harivel
  2024-03-01 11:08       ` Anthony Harivel
  1 sibling, 2 replies; 32+ messages in thread
From: Daniel P. Berrangé @ 2024-01-29 19:45 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Anthony Harivel, mtosatti, qemu-devel, vchundur

On Mon, Jan 29, 2024 at 08:33:21PM +0100, Paolo Bonzini wrote:
> On Mon, Jan 29, 2024 at 7:53 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > diff --git a/meson.build b/meson.build
> > > index d0329966f1b4..93fc233b0891 100644
> > > --- a/meson.build
> > > +++ b/meson.build
> > > @@ -4015,6 +4015,11 @@ if have_tools
> > >                 dependencies: [authz, crypto, io, qom, qemuutil,
> > >                                libcap_ng, mpathpersist],
> > >                 install: true)
> > > +
> > > +    executable('qemu-vmsr-helper', files('tools/i386/qemu-vmsr-helper.c'),
> >
> > I'd suggest 'tools/x86/' since this works fine on 64-bit too
> 
> QEMU tends to use i386 in the source to mean both 32- and 64-bit.

One day we should rename that to x86 too :-)

> > You never answered my question from the previous posting of this
> >
> > This check is merely validating the the thread ID in the message
> > is a child of the process ID connected to the socket. Any process
> > on the entire host can satisfy this requirement.
> >
> > I don't see what is limiting this to only QEMU as claimed by the
> > commit message, unless you're expecting the UNIX socket permissions
> > to be such that only processes under the qemu:qemu user:group pair
> > can access to the socket ? That would be a libvirt based permissions
> > assumption though.
> 
> Yes, this is why the systemd socket uses 600, like
> contrib/systemd/qemu-pr-helper.socket. The socket can be passed via
> SCM_RIGHTS by libvirt, or its permissions can be changed (e.g. 660 and
> root:kvm would make sense on a Debian system), or a separate helper
> can be started by libvirt.
> 
> Either way, the policy is left to the user rather than embedding it in
> the provided systemd unit.

Ok, this code needs a comment to explain that we're relying on
socket permissions to control who/what can access the daemon,
combined with this PID+TID check to validate it is not spoofing
its identity, as without context the TID check looks pointless.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 2/3] tools: build qemu-vmsr-helper
  2024-01-29 19:45       ` Daniel P. Berrangé
@ 2024-01-29 19:53         ` Daniel P. Berrangé
  2024-01-29 20:21           ` Paolo Bonzini
  2024-02-21 13:19         ` Anthony Harivel
  1 sibling, 1 reply; 32+ messages in thread
From: Daniel P. Berrangé @ 2024-01-29 19:53 UTC (permalink / raw)
  To: Paolo Bonzini, Anthony Harivel, mtosatti, qemu-devel, vchundur

On Mon, Jan 29, 2024 at 07:45:51PM +0000, Daniel P. Berrangé wrote:
> On Mon, Jan 29, 2024 at 08:33:21PM +0100, Paolo Bonzini wrote:
> > On Mon, Jan 29, 2024 at 7:53 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > diff --git a/meson.build b/meson.build
> > > > index d0329966f1b4..93fc233b0891 100644
> > > > --- a/meson.build
> > > > +++ b/meson.build
> > > > @@ -4015,6 +4015,11 @@ if have_tools
> > > >                 dependencies: [authz, crypto, io, qom, qemuutil,
> > > >                                libcap_ng, mpathpersist],
> > > >                 install: true)
> > > > +
> > > > +    executable('qemu-vmsr-helper', files('tools/i386/qemu-vmsr-helper.c'),
> > >
> > > I'd suggest 'tools/x86/' since this works fine on 64-bit too
> > 
> > QEMU tends to use i386 in the source to mean both 32- and 64-bit.
> 
> One day we should rename that to x86 too :-)
> 
> > > You never answered my question from the previous posting of this
> > >
> > > This check is merely validating the the thread ID in the message
> > > is a child of the process ID connected to the socket. Any process
> > > on the entire host can satisfy this requirement.
> > >
> > > I don't see what is limiting this to only QEMU as claimed by the
> > > commit message, unless you're expecting the UNIX socket permissions
> > > to be such that only processes under the qemu:qemu user:group pair
> > > can access to the socket ? That would be a libvirt based permissions
> > > assumption though.
> > 
> > Yes, this is why the systemd socket uses 600, like
> > contrib/systemd/qemu-pr-helper.socket. The socket can be passed via
> > SCM_RIGHTS by libvirt, or its permissions can be changed (e.g. 660 and
> > root:kvm would make sense on a Debian system), or a separate helper
> > can be started by libvirt.
> > 
> > Either way, the policy is left to the user rather than embedding it in
> > the provided systemd unit.
> 
> Ok, this code needs a comment to explain that we're relying on
> socket permissions to control who/what can access the daemon,
> combined with this PID+TID check to validate it is not spoofing
> its identity, as without context the TID check looks pointless.

Looking again, the TID is never used after being checked. QEMU sends
the TID, but the helper never does anything with this information
except to check the TID belongs the PID.  Why are we sending the TID ?

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 2/3] tools: build qemu-vmsr-helper
  2024-01-29 19:53         ` Daniel P. Berrangé
@ 2024-01-29 20:21           ` Paolo Bonzini
  0 siblings, 0 replies; 32+ messages in thread
From: Paolo Bonzini @ 2024-01-29 20:21 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: Anthony Harivel, mtosatti, qemu-devel, vchundur

On Mon, Jan 29, 2024 at 8:54 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> Looking again, the TID is never used after being checked. QEMU sends
> the TID, but the helper never does anything with this information
> except to check the TID belongs the PID.  Why are we sending the TID ?

Doh! sched_getaffinity's first argument should be the tid. (As an
aside, sched_getaffinity does not need CAP_SYS_NICE to read the
affinity of another user's thread).

Paolo



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 2/3] tools: build qemu-vmsr-helper
  2024-01-29 19:45       ` Daniel P. Berrangé
  2024-01-29 19:53         ` Daniel P. Berrangé
@ 2024-02-21 13:19         ` Anthony Harivel
  2024-02-21 13:47           ` Daniel P. Berrangé
  1 sibling, 1 reply; 32+ messages in thread
From: Anthony Harivel @ 2024-02-21 13:19 UTC (permalink / raw)
  To: Daniel P. Berrangé, Paolo Bonzini; +Cc: mtosatti, qemu-devel, vchundur

Daniel P. Berrangé, Jan 29, 2024 at 20:45:
> On Mon, Jan 29, 2024 at 08:33:21PM +0100, Paolo Bonzini wrote:
> > On Mon, Jan 29, 2024 at 7:53 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > diff --git a/meson.build b/meson.build
> > > > index d0329966f1b4..93fc233b0891 100644
> > > > --- a/meson.build
> > > > +++ b/meson.build
> > > > @@ -4015,6 +4015,11 @@ if have_tools
> > > >                 dependencies: [authz, crypto, io, qom, qemuutil,
> > > >                                libcap_ng, mpathpersist],
> > > >                 install: true)
> > > > +
> > > > +    executable('qemu-vmsr-helper', files('tools/i386/qemu-vmsr-helper.c'),
> > >
> > > I'd suggest 'tools/x86/' since this works fine on 64-bit too
> > 
> > QEMU tends to use i386 in the source to mean both 32- and 64-bit.
>
> One day we should rename that to x86 too :-)
>
> > > You never answered my question from the previous posting of this
> > >
> > > This check is merely validating the the thread ID in the message
> > > is a child of the process ID connected to the socket. Any process
> > > on the entire host can satisfy this requirement.
> > >
> > > I don't see what is limiting this to only QEMU as claimed by the
> > > commit message, unless you're expecting the UNIX socket permissions
> > > to be such that only processes under the qemu:qemu user:group pair
> > > can access to the socket ? That would be a libvirt based permissions
> > > assumption though.
> > 
> > Yes, this is why the systemd socket uses 600, like
> > contrib/systemd/qemu-pr-helper.socket. The socket can be passed via
> > SCM_RIGHTS by libvirt, or its permissions can be changed (e.g. 660 and
> > root:kvm would make sense on a Debian system), or a separate helper
> > can be started by libvirt.
> > 
> > Either way, the policy is left to the user rather than embedding it in
> > the provided systemd unit.
>
> Ok, this code needs a comment to explain that we're relying on
> socket permissions to control who/what can access the daemon,
> combined with this PID+TID check to validate it is not spoofing
> its identity, as without context the TID check looks pointless.

Hi Daniel,

would you prefer a comment in the code or a security section in the doc 
(i.e docs/specs/rapl-msr.rst) ?

Regards,
Anthony

>
>
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 2/3] tools: build qemu-vmsr-helper
  2024-02-21 13:19         ` Anthony Harivel
@ 2024-02-21 13:47           ` Daniel P. Berrangé
  2024-02-21 13:52             ` Anthony Harivel
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel P. Berrangé @ 2024-02-21 13:47 UTC (permalink / raw)
  To: Anthony Harivel; +Cc: Paolo Bonzini, mtosatti, qemu-devel, vchundur

On Wed, Feb 21, 2024 at 02:19:11PM +0100, Anthony Harivel wrote:
> Daniel P. Berrangé, Jan 29, 2024 at 20:45:
> > On Mon, Jan 29, 2024 at 08:33:21PM +0100, Paolo Bonzini wrote:
> > > On Mon, Jan 29, 2024 at 7:53 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > diff --git a/meson.build b/meson.build
> > > > > index d0329966f1b4..93fc233b0891 100644
> > > > > --- a/meson.build
> > > > > +++ b/meson.build
> > > > > @@ -4015,6 +4015,11 @@ if have_tools
> > > > >                 dependencies: [authz, crypto, io, qom, qemuutil,
> > > > >                                libcap_ng, mpathpersist],
> > > > >                 install: true)
> > > > > +
> > > > > +    executable('qemu-vmsr-helper', files('tools/i386/qemu-vmsr-helper.c'),
> > > >
> > > > I'd suggest 'tools/x86/' since this works fine on 64-bit too
> > > 
> > > QEMU tends to use i386 in the source to mean both 32- and 64-bit.
> >
> > One day we should rename that to x86 too :-)
> >
> > > > You never answered my question from the previous posting of this
> > > >
> > > > This check is merely validating the the thread ID in the message
> > > > is a child of the process ID connected to the socket. Any process
> > > > on the entire host can satisfy this requirement.
> > > >
> > > > I don't see what is limiting this to only QEMU as claimed by the
> > > > commit message, unless you're expecting the UNIX socket permissions
> > > > to be such that only processes under the qemu:qemu user:group pair
> > > > can access to the socket ? That would be a libvirt based permissions
> > > > assumption though.
> > > 
> > > Yes, this is why the systemd socket uses 600, like
> > > contrib/systemd/qemu-pr-helper.socket. The socket can be passed via
> > > SCM_RIGHTS by libvirt, or its permissions can be changed (e.g. 660 and
> > > root:kvm would make sense on a Debian system), or a separate helper
> > > can be started by libvirt.
> > > 
> > > Either way, the policy is left to the user rather than embedding it in
> > > the provided systemd unit.
> >
> > Ok, this code needs a comment to explain that we're relying on
> > socket permissions to control who/what can access the daemon,
> > combined with this PID+TID check to validate it is not spoofing
> > its identity, as without context the TID check looks pointless.
> 
> Hi Daniel,
> 
> would you prefer a comment in the code or a security section in the doc 
> (i.e docs/specs/rapl-msr.rst) ?

I think it is worth creating a docs/specs/rapl-msr.rst to explain the
overall design & usage & security considerations.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 2/3] tools: build qemu-vmsr-helper
  2024-02-21 13:47           ` Daniel P. Berrangé
@ 2024-02-21 13:52             ` Anthony Harivel
  0 siblings, 0 replies; 32+ messages in thread
From: Anthony Harivel @ 2024-02-21 13:52 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: Paolo Bonzini, mtosatti, qemu-devel, vchundur

Daniel P. Berrangé, Feb 21, 2024 at 14:47:
> On Wed, Feb 21, 2024 at 02:19:11PM +0100, Anthony Harivel wrote:
> > Daniel P. Berrangé, Jan 29, 2024 at 20:45:
> > > On Mon, Jan 29, 2024 at 08:33:21PM +0100, Paolo Bonzini wrote:
> > > > On Mon, Jan 29, 2024 at 7:53 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > > diff --git a/meson.build b/meson.build
> > > > > > index d0329966f1b4..93fc233b0891 100644
> > > > > > --- a/meson.build
> > > > > > +++ b/meson.build
> > > > > > @@ -4015,6 +4015,11 @@ if have_tools
> > > > > >                 dependencies: [authz, crypto, io, qom, qemuutil,
> > > > > >                                libcap_ng, mpathpersist],
> > > > > >                 install: true)
> > > > > > +
> > > > > > +    executable('qemu-vmsr-helper', files('tools/i386/qemu-vmsr-helper.c'),
> > > > >
> > > > > I'd suggest 'tools/x86/' since this works fine on 64-bit too
> > > > 
> > > > QEMU tends to use i386 in the source to mean both 32- and 64-bit.
> > >
> > > One day we should rename that to x86 too :-)
> > >
> > > > > You never answered my question from the previous posting of this
> > > > >
> > > > > This check is merely validating the the thread ID in the message
> > > > > is a child of the process ID connected to the socket. Any process
> > > > > on the entire host can satisfy this requirement.
> > > > >
> > > > > I don't see what is limiting this to only QEMU as claimed by the
> > > > > commit message, unless you're expecting the UNIX socket permissions
> > > > > to be such that only processes under the qemu:qemu user:group pair
> > > > > can access to the socket ? That would be a libvirt based permissions
> > > > > assumption though.
> > > > 
> > > > Yes, this is why the systemd socket uses 600, like
> > > > contrib/systemd/qemu-pr-helper.socket. The socket can be passed via
> > > > SCM_RIGHTS by libvirt, or its permissions can be changed (e.g. 660 and
> > > > root:kvm would make sense on a Debian system), or a separate helper
> > > > can be started by libvirt.
> > > > 
> > > > Either way, the policy is left to the user rather than embedding it in
> > > > the provided systemd unit.
> > >
> > > Ok, this code needs a comment to explain that we're relying on
> > > socket permissions to control who/what can access the daemon,
> > > combined with this PID+TID check to validate it is not spoofing
> > > its identity, as without context the TID check looks pointless.
> > 
> > Hi Daniel,
> > 
> > would you prefer a comment in the code or a security section in the doc 
> > (i.e docs/specs/rapl-msr.rst) ?
>
> I think it is worth creating a docs/specs/rapl-msr.rst to explain the
> overall design & usage & security considerations.

It was already included in the add-support-for-RAPL-MSRs-in-KVM-Qemu.patch 
but indeed it needs now some updates for the v4 about security and 
change in design.

Regards,
Anthony

>
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 2/3] tools: build qemu-vmsr-helper
  2024-01-29 19:33     ` Paolo Bonzini
  2024-01-29 19:45       ` Daniel P. Berrangé
@ 2024-03-01 11:08       ` Anthony Harivel
  1 sibling, 0 replies; 32+ messages in thread
From: Anthony Harivel @ 2024-03-01 11:08 UTC (permalink / raw)
  To: Paolo Bonzini, Daniel P. Berrangé; +Cc: mtosatti, qemu-devel, vchundur

Hi Paolo,

> > > +static void compute_default_paths(void)
> > > +{
> > > +    socket_path = g_build_filename("/run", "qemu-vmsr-helper.sock", NULL);
> > > +    pidfile = g_build_filename("/run", "qemu-vmsr-helper.pid", NULL);
> > > +}
> >
> > We shouldn't be hardcoding /run, we need to honour --prefix and
> > --localstatedir args given to configure.  /var/run is a symlink
> > to /run so the end result ends up the same AFAIK
>
> Indeed; just copy from scsi/qemu-pr-helper.c.
>

When I copy the same as compute-default() of scsi/qemu-pr-helper.c, the 
helper is listening to "/var/local/run/qemu-vmsr-helper.sock".

Problem is /var/local/run is 700 while /run is 755.

So I would need to adjust the qemu-vmsr-helper.service to give 
the --socket=PATH of qemu-vmsr-helper.socket (i.e 
 /run/qemu-vmsr-helper.sock)

Problem is when I do that , I fall into the case: 
"Unix socket can't be set when using socket activation"

So I'm a bit confused what to do about that.

> > You never answered my question from the previous posting of this
> >
> > This check is merely validating the the thread ID in the message
> > is a child of the process ID connected to the socket. Any process
> > on the entire host can satisfy this requirement.
> >
> > I don't see what is limiting this to only QEMU as claimed by the
> > commit message, unless you're expecting the UNIX socket permissions
> > to be such that only processes under the qemu:qemu user:group pair
> > can access to the socket ? That would be a libvirt based permissions
> > assumption though.
>
> Yes, this is why the systemd socket uses 600, like
> contrib/systemd/qemu-pr-helper.socket. The socket can be passed via
> SCM_RIGHTS by libvirt, or its permissions can be changed (e.g. 660 and
> root:kvm would make sense on a Debian system), or a separate helper
> can be started by libvirt.
>
> Either way, the policy is left to the user rather than embedding it in
> the provided systemd unit.
>

During this patch test, when I run by hand my VM (without libvirt),
the vmsr helper systemd service/socket was like that: 
[Service]
ExecStart=/usr/bin/qemu-vmsr-helper
User=root
Group=root

and 

[Socket]
ListenStream=/run/qemu-vmsr-helper.sock
SocketUser=qemu
SocketGroup=qemu
SocketMode=0660

And it seems to works. So I'm not sure 100% what to do in my patch.

Should I follow the pr-helper systemd files anyway ?

Regards,
Anthony



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-01-25  7:22 [PATCH v3 0/3] Add support for the RAPL MSRs series Anthony Harivel
  2024-01-25  7:22 ` [PATCH v3 1/3] qio: add support for SO_PEERCRED for socket channel Anthony Harivel
  2024-01-25  7:22 ` [PATCH v3 2/3] tools: build qemu-vmsr-helper Anthony Harivel
@ 2024-01-25  7:22 ` Anthony Harivel
  2024-01-29 19:29   ` Daniel P. Berrangé
                     ` (2 more replies)
  2 siblings, 3 replies; 32+ messages in thread
From: Anthony Harivel @ 2024-01-25  7:22 UTC (permalink / raw)
  To: pbonzini, mtosatti, berrange; +Cc: qemu-devel, vchundur, Anthony Harivel

Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
interface (Running Average Power Limit) for advertising the accumulated
energy consumption of various power domains (e.g. CPU packages, DRAM,
etc.).

The consumption is reported via MSRs (model specific registers) like
MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
64 bits registers that represent the accumulated energy consumption in
micro Joules. They are updated by microcode every ~1ms.

For now, KVM always returns 0 when the guest requests the value of
these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
these MSRs dynamically in userspace.

To limit the amount of system calls for every MSR call, create a new
thread in QEMU that updates the "virtual" MSR values asynchronously.

Each vCPU has its own vMSR to reflect the independence of vCPUs. The
thread updates the vMSR values with the ratio of energy consumed of
the whole physical CPU package the vCPU thread runs on and the
thread's utime and stime values.

All other non-vCPU threads are also taken into account. Their energy
consumption is evenly distributed among all vCPUs threads running on
the same physical CPU package.

To overcome the problem that reading the RAPL MSR requires priviliged
access, a socket communication between QEMU and the qemu-vmsr-helper is
mandatory. You can specified the socket path in the parameter.

This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock

Actual limitation:
- Works only on Intel host CPU because AMD CPUs are using different MSR
  adresses.

- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at
  the moment.

Signed-off-by: Anthony Harivel <aharivel@redhat.com>
---
 accel/kvm/kvm-all.c           |  27 +++
 docs/specs/index.rst          |   1 +
 docs/specs/rapl-msr.rst       | 133 +++++++++++++
 include/sysemu/kvm_int.h      |  17 ++
 target/i386/cpu.h             |   8 +
 target/i386/kvm/kvm.c         | 348 ++++++++++++++++++++++++++++++++++
 target/i386/kvm/meson.build   |   1 +
 target/i386/kvm/vmsr_energy.c | 295 ++++++++++++++++++++++++++++
 target/i386/kvm/vmsr_energy.h |  87 +++++++++
 9 files changed, 917 insertions(+)
 create mode 100644 docs/specs/rapl-msr.rst
 create mode 100644 target/i386/kvm/vmsr_energy.c
 create mode 100644 target/i386/kvm/vmsr_energy.h

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 49e755ec4ad2..d63a6af91291 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3603,6 +3603,21 @@ static void kvm_set_device(Object *obj,
     s->device = g_strdup(value);
 }
 
+static void kvm_set_kvm_rapl(Object *obj, bool value, Error **errp)
+{
+    KVMState *s = KVM_STATE(obj);
+    s->msr_energy.enable = value;
+}
+
+static void kvm_set_kvm_rapl_socket_path(Object *obj,
+                                         const char *str,
+                                         Error **errp)
+{
+    KVMState *s = KVM_STATE(obj);
+    g_free(s->msr_energy.socket_path);
+    s->msr_energy.socket_path = g_strdup(str);
+}
+
 static void kvm_accel_instance_init(Object *obj)
 {
     KVMState *s = KVM_STATE(obj);
@@ -3622,6 +3637,7 @@ static void kvm_accel_instance_init(Object *obj)
     s->xen_gnttab_max_frames = 64;
     s->xen_evtchn_max_pirq = 256;
     s->device = NULL;
+    s->msr_energy.enable = false;
 }
 
 /**
@@ -3666,6 +3682,17 @@ static void kvm_accel_class_init(ObjectClass *oc, void *data)
     object_class_property_set_description(oc, "device",
         "Path to the device node to use (default: /dev/kvm)");
 
+    object_class_property_add_bool(oc, "rapl",
+                                   NULL,
+                                   kvm_set_kvm_rapl);
+    object_class_property_set_description(oc, "rapl",
+        "Allow energy related MSRs for RAPL interface in Guest");
+
+    object_class_property_add_str(oc, "rapl-helper-socket", NULL,
+                                  kvm_set_kvm_rapl_socket_path);
+    object_class_property_set_description(oc, "rapl-helper-socket",
+        "Socket Path for comminucating with the Virtual MSR helper daemon");
+
     kvm_arch_accel_class_init(oc);
 }
 
diff --git a/docs/specs/index.rst b/docs/specs/index.rst
index b3f482b0aa58..b426ebb7713c 100644
--- a/docs/specs/index.rst
+++ b/docs/specs/index.rst
@@ -32,3 +32,4 @@ guest hardware that is specific to QEMU.
    virt-ctlr
    vmcoreinfo
    vmgenid
+   rapl-msr
diff --git a/docs/specs/rapl-msr.rst b/docs/specs/rapl-msr.rst
new file mode 100644
index 000000000000..04d27c198fc0
--- /dev/null
+++ b/docs/specs/rapl-msr.rst
@@ -0,0 +1,133 @@
+================
+RAPL MSR support
+================
+
+The RAPL interface (Running Average Power Limit) is advertising the accumulated
+energy consumption of various power domains (e.g. CPU packages, DRAM, etc.).
+
+The consumption is reported via MSRs (model specific registers) like
+MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are 64 bits
+registers that represent the accumulated energy consumption in micro Joules.
+
+Thanks to the MSR Filtering patch [#a]_ not all MSRs are handled by KVM. Some
+of them can now be handled by the userspace (QEMU). It uses a mechanism called
+"MSR filtering" where a list of MSRs is given at init time of a VM to KVM so
+that a callback is put in place. The design of this patch uses only this
+mechanism for handling the MSRs between guest/host.
+
+At the moment the following MSRs are involved:
+
+.. code:: C
+
+    #define MSR_RAPL_POWER_UNIT             0x00000606
+    #define MSR_PKG_POWER_LIMIT             0x00000610
+    #define MSR_PKG_ENERGY_STATUS           0x00000611
+    #define MSR_PKG_POWER_INFO              0x00000614
+
+The ``*_POWER_UNIT``, ``*_POWER_LIMIT``, ``*_POWER INFO`` are part of the RAPL
+spec and specify the power limit of the package, provide range of parameter(min
+power, max power,..) and also the information of the multiplier for the energy
+counter to calculate the power. Those MSRs are populated once at the beginning
+by reading the host CPU MSRs and are given back to the guest 1:1 when
+requested.
+
+The MSR_PKG_ENERGY_STATUS is a counter; it represents the total amount of
+energy consumed since the last time the register was cleared. If you multiply
+it with the UNIT provided above you'll get the power in micro-joules. This
+counter is always increasing and it increases more or less faster depending on
+the consumption of the package. This counter is supposed to overflow at some
+point.
+
+Each core belonging to the same Package reading the MSR_PKG_ENERGY_STATUS (i.e
+"rdmsr 0x611") will retrieve the same value. The value represents the energy
+for the whole package. Whatever Core reading it will get the same value and a
+core that belongs to PKG-0 will not be able to get the value of PKG-1 and
+vice-versa.
+
+High level implementation
+-------------------------
+
+In order to update the value of the virtual MSR, a QEMU thread is created.
+The thread is basically just an infinity loop that does:
+
+1. Snapshot of the time metrics of all QEMU threads (Time spent scheduled in
+   Userspace and System)
+
+2. Snapshot of the actual MSR_PKG_ENERGY_STATUS counter of all packages where
+   the QEMU threads are running on.
+
+3. Sleep for 1 second - During this pause the vcpu and other non-vcpu threads
+   will do what they have to do and so the energy counter will increase.
+
+4. Repeat 2. and 3. and calculate the delta of every metrics representing the
+   time spent scheduled for each QEMU thread *and* the energy spent by the
+   packages during the pause.
+
+5. Filter the vcpu threads and the non-vcpu threads.
+
+6. Retrieve the topology of the Virtual Machine. This helps identify which
+   vCPU is running on which virtual package.
+
+7. The total energy spent by the non-vcpu threads is divided by the number
+   of vcpu threads so that each vcpu thread will get an equal part of the
+   energy spent by the QEMU workers.
+
+8. Calculate the ratio of energy spent per vcpu threads.
+
+9. Calculate the energy for each virtual package.
+
+10. The virtual MSRs are updated for each virtual package. Each vCPU that
+    belongs to the same package will return the same value when accessing the
+    the MSR.
+
+11. Loop back to 1.
+
+Ratio calculation
+-----------------
+
+In Linux, a process has an execution time associated with it. The scheduler is
+dividing the time in clock ticks. The number of clock ticks per second can be
+found by the sysconf system call. A typical value of clock ticks per second is
+100. So a core can run a process at the maximum of 100 ticks per second. If a
+package has 4 cores, 400 ticks maximum can be scheduled on all the cores
+of the package for a period of 1 second.
+
+The /proc/[pid]/stat [#b]_ is a sysfs file that can give the executed time of a
+process with the [pid] as the process ID. It gives the amount of ticks the
+process has been scheduled in userspace (utime) and kernel space (stime).
+
+By reading those metrics for a thread, one can calculate the ratio of time the
+package has spent executing the thread.
+
+Example:
+
+A 4 cores package can schedule a maximum of 400 ticks per second with 100 ticks
+per second per core. If a thread was scheduled for 100 ticks between a second
+on this package, that means my thread has been scheduled for 1/4 of the whole
+package. With that, the calculation of the energy spent by the thread on this
+package during this whole second is 1/4 of the total energy spent by the
+package.
+
+Usage
+-----
+
+This feature is activated with -accel
+kvm,rapl=true,rapl-helper-socket=/path/sock.sock
+
+It is important that the socket path is the same as the one
+:program:`qemu-vmsr-helper` is listneing to.
+
+Current Limitations
+-------------------
+
+- Works only on Intel host CPUs because AMD CPUs are using different MSR
+  addresses.
+
+- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at the
+  moment.
+
+References
+----------
+
+.. [#a] https://patchwork.kernel.org/project/kvm/patch/20200916202951.23760-7-graf@amazon.com/
+.. [#b] https://man7.org/linux/man-pages/man5/proc.5.html
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 882e37e12c5b..ee2fe8817833 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -14,6 +14,8 @@
 #include "qemu/accel.h"
 #include "qemu/queue.h"
 #include "sysemu/kvm.h"
+#include "hw/boards.h"
+#include "hw/i386/topology.h"
 
 typedef struct KVMSlot
 {
@@ -48,6 +50,20 @@ typedef struct KVMMemoryListener {
 
 #define KVM_MSI_HASHTAB_SIZE    256
 
+struct KVMMsrEnergy {
+    bool enable;
+    char *socket_path;
+    QemuThread msr_thr;
+    unsigned int cpus;
+    unsigned int sockets;
+    X86CPUTopoInfo topo_info;
+    const CPUArchIdList *cpu_list;
+    uint64_t *msr_value;
+    uint64_t msr_unit;
+    uint64_t msr_limit;
+    uint64_t msr_info;
+};
+
 enum KVMDirtyRingReaperState {
     KVM_DIRTY_RING_REAPER_NONE = 0,
     /* The reaper is sleeping */
@@ -114,6 +130,7 @@ struct KVMState
     bool kvm_dirty_ring_with_bitmap;
     uint64_t kvm_eager_split_size;  /* Eager Page Splitting chunk size */
     struct KVMDirtyRingReaper reaper;
+    struct KVMMsrEnergy msr_energy;
     NotifyVmexitOption notify_vmexit;
     uint32_t notify_window;
     uint32_t xen_version;
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 7f0786e8b98f..8d861296b1af 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -396,6 +396,10 @@ typedef enum X86Seg {
 #define MSR_IA32_TSX_CTRL		0x122
 #define MSR_IA32_TSCDEADLINE            0x6e0
 #define MSR_IA32_PKRS                   0x6e1
+#define MSR_RAPL_POWER_UNIT             0x00000606
+#define MSR_PKG_POWER_LIMIT             0x00000610
+#define MSR_PKG_ENERGY_STATUS           0x00000611
+#define MSR_PKG_POWER_INFO              0x00000614
 #define MSR_ARCH_LBR_CTL                0x000014ce
 #define MSR_ARCH_LBR_DEPTH              0x000014cf
 #define MSR_ARCH_LBR_FROM_0             0x00001500
@@ -1788,6 +1792,10 @@ typedef struct CPUArchState {
 
     uintptr_t retaddr;
 
+    /* RAPL MSR */
+    uint64_t msr_rapl_power_unit;
+    uint64_t msr_pkg_energy_status;
+
     /* Fields up to this point are cleared by a CPU reset */
     struct {} end_reset_fields;
 
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 76a66246eb72..e6cb315c0a90 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -16,16 +16,22 @@
 #include "qapi/qapi-events-run-state.h"
 #include "qapi/error.h"
 #include "qapi/visitor.h"
+#include <math.h>
+#include <stdint.h>
 #include <sys/ioctl.h>
 #include <sys/utsname.h>
 #include <sys/syscall.h>
+#include <sys/resource.h>
+#include <sys/time.h>
 
 #include <linux/kvm.h>
+#include <unistd.h>
 #include "standard-headers/asm-x86/kvm_para.h"
 #include "hw/xen/interface/arch-x86/cpuid.h"
 
 #include "cpu.h"
 #include "host-cpu.h"
+#include "vmsr_energy.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/hw_accel.h"
 #include "sysemu/kvm_int.h"
@@ -2475,6 +2481,49 @@ static bool kvm_rdmsr_core_thread_count(X86CPU *cpu, uint32_t msr,
     return true;
 }
 
+static bool kvm_rdmsr_rapl_power_unit(X86CPU *cpu, uint32_t msr,
+                                        uint64_t *val)
+{
+
+    CPUState *cs = CPU(cpu);
+
+    *val = cs->kvm_state->msr_energy.msr_unit;
+
+    return true;
+}
+
+static bool kvm_rdmsr_pkg_power_limit(X86CPU *cpu, uint32_t msr,
+                                        uint64_t *val)
+{
+
+    CPUState *cs = CPU(cpu);
+
+    *val = cs->kvm_state->msr_energy.msr_limit;
+
+    return true;
+}
+
+static bool kvm_rdmsr_pkg_power_info(X86CPU *cpu, uint32_t msr,
+                                        uint64_t *val)
+{
+
+    CPUState *cs = CPU(cpu);
+
+    *val = cs->kvm_state->msr_energy.msr_info;
+
+    return true;
+}
+
+static bool kvm_rdmsr_pkg_energy_status(X86CPU *cpu, uint32_t msr,
+    uint64_t *val)
+{
+
+    CPUState *cs = CPU(cpu);
+    *val = cs->kvm_state->msr_energy.msr_value[cs->cpu_index];
+
+    return true;
+}
+
 static Notifier smram_machine_done;
 static KVMMemoryListener smram_listener;
 static AddressSpace smram_address_space;
@@ -2509,6 +2558,265 @@ static void register_smram_listener(Notifier *n, void *unused)
                                  &smram_address_space, 1, "kvm-smram");
 }
 
+static void *kvm_msr_energy_thread(void *data)
+{
+    KVMState *s = data;
+    struct KVMMsrEnergy *vmsr = &s->msr_energy;
+
+    g_autofree package_energy_stat *pkg_stat = NULL;
+    g_autofree thread_stat *thd_stat = NULL;
+    g_autofree pid_t *thread_ids = NULL;
+    g_autofree CPUState *cpu = NULL;
+    unsigned int maxpkgs, maxcpus, maxticks;
+    g_autofree unsigned int *vpkgs_energy_stat = NULL;
+    unsigned int num_threads = 0;
+    unsigned int tmp_num_threads = 0;
+    pid_t pid;
+
+    X86CPUTopoIDs topo_ids;
+
+
+    rcu_register_thread();
+
+    /* Get QEMU PID*/
+    pid = getpid();
+
+    /* Nb of CPUS per packages */
+    maxcpus = vmsr_get_maxcpus(0);
+
+    /* Nb of Physical Packages on the system */
+    maxpkgs = vmsr_get_max_physical_package(maxcpus);
+
+    /* Those MSR values should not change as well */
+    vmsr->msr_unit  = vmsr_read_msr(MSR_RAPL_POWER_UNIT, 0, pid,
+                                    s->msr_energy.socket_path);
+    vmsr->msr_limit = vmsr_read_msr(MSR_PKG_POWER_LIMIT, 0, pid,
+                                    s->msr_energy.socket_path);
+    vmsr->msr_info  = vmsr_read_msr(MSR_PKG_POWER_INFO, 0, pid,
+                                    s->msr_energy.socket_path);
+
+    /* Allocate memory for each package energy status */
+    pkg_stat = (package_energy_stat *)
+        g_new0(package_energy_stat, maxpkgs);
+
+    /* Pre-allocate memory for thread stats */
+    thd_stat = g_new0(thread_stat, 1);
+
+    /* Pre-allocate memory for holding Virtual Package Energy counter */
+    vpkgs_energy_stat = g_new0(unsigned int, vmsr->sockets);
+
+    /*
+     * Max numbers of ticks per package
+     * time in second * number of ticks/second * Number of cores / package
+     * ex: for 100 ticks/second/CPU, 12 CPUs per Package gives 1200 ticks max
+     */
+    maxticks = (MSR_ENERGY_THREAD_SLEEP_US / 1000000)
+                    * sysconf(_SC_CLK_TCK) * maxcpus;
+
+    while (true) {
+        /* Get all qemu threads id */
+        thread_ids = vmsr_get_thread_ids(pid, &num_threads);
+
+        if (thread_ids == NULL) {
+            goto clean;
+        }
+
+        if (tmp_num_threads < num_threads) {
+            thd_stat = g_renew(thread_stat, thd_stat, num_threads);
+        }
+
+        tmp_num_threads = num_threads;
+
+        /* Populate all the thread stats */
+        for (int i = 0; i < num_threads; i++) {
+            thd_stat[i].utime = g_new0(unsigned long long, 2);
+            thd_stat[i].stime = g_new0(unsigned long long, 2);
+            thd_stat[i].thread_id = thread_ids[i];
+            vmsr_read_thread_stat(&thd_stat[i], pid, 0);
+            thd_stat[i].numa_node_id = numa_node_of_cpu(thd_stat[i].cpu_id);
+        }
+
+        /* Retrieve all packages power plane energy counter */
+        for (int i = 0; i <= maxpkgs; i++) {
+            for (int j = 0; j < num_threads; j++) {
+                /*
+                 * Use the first thread we found that ran on the CPU
+                 * of the package to read the packages energy counter
+                 */
+                if (thd_stat[j].numa_node_id == i) {
+                    pkg_stat[i].e_start =
+                    vmsr_read_msr(MSR_PKG_ENERGY_STATUS, i, pid,
+                                  s->msr_energy.socket_path);
+                    break;
+                }
+            }
+        }
+
+        /* Sleep a short period while the other threads are working */
+        usleep(MSR_ENERGY_THREAD_SLEEP_US);
+
+        /*
+         * Retrieve all packages power plane energy counter
+         * Calculate the delta of all packages
+         */
+        for (int i = 0; i <= maxpkgs; i++) {
+            for (int j = 0; j < num_threads; j++) {
+                /*
+                 * Use the first thread we found that ran on the CPU
+                 * of the package to read the packages energy counter
+                 */
+                if (thd_stat[j].numa_node_id == i) {
+                    pkg_stat[i].e_end =
+                       vmsr_read_msr(MSR_PKG_ENERGY_STATUS,
+                                     thd_stat[j].cpu_id,
+                                     thd_stat[j].thread_id,
+                                     s->msr_energy.socket_path);
+                    /*
+                     * Prevent the case we have migrate the VM
+                     * during the sleep period or any other cases
+                     * were energy counter might be lower after
+                     * the sleep.
+                     */
+                    if (pkg_stat[i].e_end > pkg_stat[i].e_start) {
+                        pkg_stat[i].e_delta =
+                            pkg_stat[i].e_end - pkg_stat[i].e_start;
+                    } else {
+                        pkg_stat[i].e_delta = 0;
+                    }
+                    break;
+                }
+            }
+        }
+
+        /* Delta of ticks spend by each thread between the sample */
+        for (int i = 0; i < num_threads; i++) {
+            if (vmsr_read_thread_stat(&thd_stat[i], pid, 1) != 0) {
+                /*
+                 * We don't count the dead thread
+                 * i.e threads that existed before the sleep
+                 * and not anymore
+                 */
+                thd_stat[i].delta_ticks = 0;
+            } else {
+                vmsr_delta_ticks(thd_stat, i);
+            }
+        }
+
+        /*
+         * Identify the vCPU threads
+         * Calculate the Number of vCPU per package
+         */
+        CPU_FOREACH(cpu) {
+            for (int i = 0; i < num_threads; i++) {
+                if (cpu->thread_id == thd_stat[i].thread_id) {
+                    thd_stat[i].is_vcpu = true;
+                    thd_stat[i].vcpu_id = cpu->cpu_index;
+                    pkg_stat[thd_stat[i].numa_node_id].nb_vcpu++;
+                    thd_stat[i].acpi_id = kvm_arch_vcpu_id(cpu);
+                    break;
+                }
+            }
+        }
+
+        /* Retrieve the virtual package number of each vCPU */
+        for (int i = 0; i < vmsr->x86_cpu_list->len; i++) {
+            for (int j = 0; j < num_threads; j++) {
+                if ((thd_stat[j].acpi_id == vmsr->x86_cpu_list->cpus[i].arch_id)
+                    && (thd_stat[j].is_vcpu == true)) {
+                    x86_topo_ids_from_apicid(thd_stat[j].acpi_id,
+                        &vmsr->topo_info, &topo_ids);
+                    thd_stat[j].vpkg = topo_ids.pkg_id;
+                }
+            }
+        }
+
+        /* Calculate the total energy of all non-vCPU thread */
+        for (int i = 0; i < num_threads; i++) {
+            double temp;
+            if ((thd_stat[i].is_vcpu != true) &&
+                (thd_stat[i].delta_ticks > 0)) {
+                temp = vmsr_get_ratio(pkg_stat, thd_stat, maxticks, i);
+                pkg_stat[thd_stat[i].numa_node_id].e_ratio
+                    += (uint64_t)lround(temp);
+            }
+        }
+
+        /* Calculate the ratio per non-vCPU thread of each package */
+        for (int i = 0; i <= maxpkgs; i++) {
+            if (pkg_stat[i].nb_vcpu > 0) {
+                pkg_stat[i].e_ratio = pkg_stat[i].e_ratio / pkg_stat[i].nb_vcpu;
+            }
+        }
+
+        /*
+         * Calculate the energy for each Package:
+         * Energy Package = sum of each vCPU energy that belongs to the package
+         */
+        for (int i = 0; i < num_threads; i++) {
+            double temp;
+
+            if ((thd_stat[i].is_vcpu == true) && \
+                    (thd_stat[i].delta_ticks > 0)) {
+                temp = vmsr_get_ratio(pkg_stat, thd_stat, maxticks, i);
+
+                vpkgs_energy_stat[thd_stat[i].vpkg] += (uint64_t)lround(temp);
+                vpkgs_energy_stat[thd_stat[i].vpkg] +=
+                    pkg_stat[thd_stat[i].numa_node_id].e_ratio;
+            }
+        }
+
+        /*
+         * Finally populate the vmsr register of each vCPU with the total
+         * package value to emulate the real hardware where each CPU return the
+         * value of the package it belongs.
+         */
+        for (int i = 0; i < num_threads; i++) {
+            if ((thd_stat[i].is_vcpu == true) && \
+                    (thd_stat[i].delta_ticks > 0)) {
+                vmsr->msr_value[thd_stat[i].vcpu_id] = \
+                                            vpkgs_energy_stat[thd_stat[i].vpkg];
+          }
+        }
+
+        /* Zero out the memory */
+        for (int i = 0; i < num_threads; i++) {
+            memset(thd_stat[i].utime, 0, 2 * sizeof(unsigned long long));
+            memset(thd_stat[i].stime, 0, 2 * sizeof(unsigned long long));
+        }
+        memset(thd_stat, 0, num_threads * sizeof(thread_stat));
+        memset(thread_ids, 0, sizeof(pid_t));
+    }
+
+clean:
+    rcu_unregister_thread();
+    return NULL;
+}
+
+static int kvm_msr_energy_thread_init(KVMState *s, MachineState *ms)
+{
+    struct KVMMsrEnergy *r = &s->msr_energy;
+
+    vmsr_init_topo_info(&r->topo_info, ms);
+
+    /* Retrieve the number of vCPU */
+    r->cpus = ms->smp.cpus;
+
+    /* Retrieve the number of sockets */
+    r->sockets = ms->smp.sockets;
+
+    /* Allocate register memory (MSR_PKG_STATUS) for each vCPU */
+    r->msr_value = g_new0(uint64_t, r->cpus);
+
+    /* Retrieve the CPUArchIDlist */
+    r->x86_cpu_list = x86_possible_cpu_arch_ids(ms);
+
+    qemu_thread_create(&r->msr_thr, "kvm-msr",
+                       kvm_msr_energy_thread,
+                       s, QEMU_THREAD_JOINABLE);
+
+    return 0;
+}
+
 int kvm_arch_get_default_type(MachineState *ms)
 {
     return 0;
@@ -2711,6 +3019,46 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
                          strerror(-ret));
             exit(1);
         }
+
+        if (s->msr_energy.enable == true) {
+
+            r = kvm_filter_msr(s, MSR_RAPL_POWER_UNIT,
+                               kvm_rdmsr_rapl_power_unit, NULL);
+            if (!r) {
+                error_report("Could not install MSR_RAPL_POWER_UNIT \
+                                handler: %s",
+                             strerror(-ret));
+                exit(1);
+            }
+
+            r = kvm_filter_msr(s, MSR_PKG_POWER_LIMIT,
+                               kvm_rdmsr_pkg_power_limit, NULL);
+            if (!r) {
+                error_report("Could not install MSR_PKG_POWER_LIMIT \
+                                handler: %s",
+                             strerror(-ret));
+                exit(1);
+            }
+
+            r = kvm_filter_msr(s, MSR_PKG_POWER_INFO,
+                               kvm_rdmsr_pkg_power_info, NULL);
+            if (!r) {
+                error_report("Could not install MSR_PKG_POWER_INFO \
+                                handler: %s",
+                             strerror(-ret));
+                exit(1);
+            }
+            r = kvm_filter_msr(s, MSR_PKG_ENERGY_STATUS,
+                               kvm_rdmsr_pkg_energy_status, NULL);
+            if (!r) {
+                error_report("Could not install MSR_PKG_ENERGY_STATUS \
+                                handler: %s",
+                             strerror(-ret));
+                exit(1);
+            } else {
+                kvm_msr_energy_thread_init(s, ms);
+            }
+        }
     }
 
     return 0;
diff --git a/target/i386/kvm/meson.build b/target/i386/kvm/meson.build
index 84d9143e6029..16010638df69 100644
--- a/target/i386/kvm/meson.build
+++ b/target/i386/kvm/meson.build
@@ -3,6 +3,7 @@ i386_kvm_ss = ss.source_set()
 i386_kvm_ss.add(files(
   'kvm.c',
   'kvm-cpu.c',
+  'vmsr_energy.c',
 ))
 
 i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen-emu.c'))
diff --git a/target/i386/kvm/vmsr_energy.c b/target/i386/kvm/vmsr_energy.c
new file mode 100644
index 000000000000..27588630efa4
--- /dev/null
+++ b/target/i386/kvm/vmsr_energy.c
@@ -0,0 +1,295 @@
+/*
+ * QEMU KVM support -- x86 virtual energy-related MSR.
+ *
+ * Copyright 2023 Red Hat, Inc. 2023
+ *
+ *  Author:
+ *      Anthony Harivel <aharivel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "vmsr_energy.h"
+#include "qapi/error.h"
+#include "io/channel.h"
+#include "io/channel-socket.h"
+#include "hw/boards.h"
+
+#define MAX_PATH_LEN 256
+#define MAX_LINE_LEN 500
+
+static char *compute_default_paths(void)
+{
+    g_autofree char *state = qemu_get_local_state_dir();
+
+    return g_build_filename(state, "run", "qemu-vmsr-helper.sock", NULL);
+}
+
+static int vmsr_helper_socket_read(QIOChannel *ioc,
+                                  void *buf, int sz, Error **errp)
+{
+    ssize_t r = qio_channel_read_all(ioc, buf, sz, errp);
+
+    if (r < 0) {
+        object_unref(OBJECT(ioc));
+        ioc = NULL;
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+static int vmsr_helper_socket_write(QIOChannel *ioc,
+                                   int fd,
+                                   const void *buf, int sz, Error **errp)
+{
+    size_t nfds = (fd != -1);
+    while (sz > 0) {
+        struct iovec iov;
+        ssize_t n_written;
+
+        iov.iov_base = (void *)buf;
+        iov.iov_len = sz;
+        n_written = qio_channel_writev_full(QIO_CHANNEL(ioc), &iov, 1,
+                                            nfds ? &fd : NULL, nfds, 0, errp);
+
+        if (n_written <= 0) {
+            assert(n_written != QIO_CHANNEL_ERR_BLOCK);
+            object_unref(OBJECT(ioc));
+            ioc = NULL;
+            return n_written < 0 ? -EINVAL : 0;
+        }
+
+        nfds = 0;
+        buf += n_written;
+        sz -= n_written;
+    }
+
+    return 0;
+}
+
+uint64_t vmsr_read_msr(uint32_t reg, unsigned int cpu_id, uint32_t tid,
+                       const char *path)
+{
+    uint64_t data = 0;
+    char *socket_path = NULL;
+    unsigned int buffer[3];
+
+    if (path == NULL) {
+        socket_path = compute_default_paths();
+    } else {
+        socket_path = g_strdup(path);
+    }
+
+    SocketAddress saddr = {
+        .type = SOCKET_ADDRESS_TYPE_UNIX,
+        .u.q_unix.path = socket_path
+    };
+    QIOChannelSocket *sioc = qio_channel_socket_new();
+    Error *local_err = NULL;
+
+    int r;
+
+    qio_channel_set_name(QIO_CHANNEL(sioc), "vmsr-helper");
+    qio_channel_socket_connect_sync(sioc,
+                                    &saddr,
+                                    &local_err);
+    g_free(socket_path);
+    if (local_err) {
+        goto out_close;
+    }
+
+    /*
+     * Send the required arguments:
+     * 1. RAPL MSR register to read
+     * 2. On which CPU ID
+     * 3. From which vCPU (Thread ID)
+     */
+    buffer[0] = reg;
+    buffer[1] = cpu_id;
+    buffer[2] = tid;
+
+    r = vmsr_helper_socket_write(QIO_CHANNEL(sioc),
+                                 -1,
+                                 &buffer, sizeof(buffer),
+                                 &local_err);
+    if (r < 0) {
+        goto out_close;
+    }
+
+    r = vmsr_helper_socket_read(QIO_CHANNEL(sioc),
+                                &data, sizeof(data),
+                                &local_err);
+    if (r < 0) {
+        data = 0;
+        goto out_close;
+    }
+
+out_close:
+    /* Close socket. */
+    qio_channel_close(QIO_CHANNEL(sioc), NULL);
+    object_unref(OBJECT(sioc));
+    return data;
+}
+
+/* Retrieve the max number of physical CPU on the package */
+unsigned int vmsr_get_maxcpus(unsigned int package_num)
+{
+    int k, ncpus;
+    unsigned int maxcpus;
+    struct bitmask *cpus;
+
+    cpus = numa_allocate_cpumask();
+    ncpus = cpus->size;
+
+    if (numa_node_to_cpus(package_num, cpus) < 0) {
+        return 0;
+    }
+
+    maxcpus = 0;
+    for (k = 0; k < ncpus; k++) {
+        if (numa_bitmask_isbitset(cpus, k)) {
+            maxcpus++;
+        }
+    }
+
+    return maxcpus;
+}
+
+/* Retrieve the maximum number of physical packages */
+unsigned int vmsr_get_max_physical_package(unsigned int max_cpus)
+{
+    unsigned int packageCount = 0;
+    const char *dir = "/sys/devices/system/cpu/";
+    int *uniquePackages;
+
+    char *filePath;
+    FILE *file;
+
+    uniquePackages = g_new0(int, max_cpus);
+
+    for (int i = 0; i < max_cpus; i++) {
+        filePath = g_build_filename(dir, g_strdup_printf("cpu%d", i),
+                                     "topology/physical_package_id", NULL);
+
+        file = fopen(filePath, "r");
+
+        if (file == NULL) {
+            perror("Error opening file");
+            g_free(filePath);
+            g_free(uniquePackages);
+            return 0;
+        }
+
+        char packageId[10];
+        if (fgets(packageId, sizeof(packageId), file) == NULL) {
+            packageCount = 0;
+        }
+
+        fclose(file);
+
+        int currentPackageId = atoi(packageId);
+
+        bool isUnique = true;
+        for (int j = 0; j < packageCount; j++) {
+            if (uniquePackages[j] == currentPackageId) {
+                isUnique = false;
+                break;
+            }
+        }
+
+        if (isUnique) {
+            uniquePackages[packageCount] = currentPackageId;
+            packageCount++;
+
+            if (packageCount >= max_cpus) {
+                break;
+            }
+        }
+    }
+
+    g_free(filePath);
+    g_free(uniquePackages);
+    return (packageCount == 0) ? 1 : packageCount;
+}
+int vmsr_read_thread_stat(struct thread_stat *thread, int pid, int index)
+{
+    char *path;
+    path = g_new0(char, MAX_PATH_LEN);
+
+    path = g_build_filename(g_strdup_printf("/proc/%u/task/%d/stat", pid, \
+            thread->thread_id), NULL);
+
+    FILE *file = fopen(path, "r");
+    if (file == NULL) {
+        return -1;
+    }
+
+    if (fscanf(file, "%*d (%*[^)]) %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u"
+        " %llu %llu %*d %*d %*d %*d %*d %*d %*u %*u %*d %*u %*u"
+        " %*u %*u %*u %*u %*u %*u %*u %*u %*u %*d %*u %*u %u",
+           &thread->utime[index], &thread->stime[index], &thread->cpu_id) != 3)
+        return -1;
+
+    fclose(file);
+    return 0;
+}
+
+/* Read QEMU stat task folder to retrieve all QEMU threads ID */
+pid_t *vmsr_get_thread_ids(pid_t pid, unsigned int *num_threads)
+{
+    char *path = g_build_filename("/proc", g_strdup_printf("%d/task", pid), NULL);
+
+    DIR *dir = opendir(path);
+    if (dir == NULL) {
+        perror("opendir");
+        g_free(path);
+        return NULL;
+    }
+
+    pid_t *thread_ids = NULL;
+    unsigned int thread_count = 0;
+
+    struct dirent *ent;
+    while ((ent = readdir(dir)) != NULL) {
+        if (ent->d_name[0] == '.') {
+            continue;
+        }
+        pid_t tid = atoi(ent->d_name);
+        if (pid != tid) {
+            thread_ids = g_renew(pid_t, thread_ids, (thread_count + 1));
+            thread_ids[thread_count] = tid;
+            thread_count++;
+        }
+    }
+
+    closedir(dir);
+
+    *num_threads = thread_count;
+    g_free(path);
+    return thread_ids;
+}
+
+void vmsr_delta_ticks(thread_stat *thd_stat, int i)
+{
+    thd_stat[i].delta_ticks = (thd_stat[i].utime[1] + thd_stat[i].stime[1])
+                            - (thd_stat[i].utime[0] + thd_stat[i].stime[0]);
+}
+
+double vmsr_get_ratio(package_energy_stat *pkg_stat,
+                        thread_stat *thd_stat,
+                        int maxticks, int i) {
+
+    return (pkg_stat[thd_stat[i].numa_node_id].e_delta / 100.0)
+            * ((100.0 / maxticks) * thd_stat[i].delta_ticks);
+}
+
+void vmsr_init_topo_info(X86CPUTopoInfo *topo_info,
+                           const MachineState *ms)
+{
+    topo_info->dies_per_pkg = ms->smp.dies;
+    topo_info->cores_per_die = ms->smp.cores;
+    topo_info->threads_per_core = ms->smp.threads;
+}
diff --git a/target/i386/kvm/vmsr_energy.h b/target/i386/kvm/vmsr_energy.h
new file mode 100644
index 000000000000..385d53b9c16c
--- /dev/null
+++ b/target/i386/kvm/vmsr_energy.h
@@ -0,0 +1,87 @@
+/*
+ * QEMU KVM support -- x86 virtual energy-related MSR.
+ *
+ * Copyright 2023 Red Hat, Inc. 2023
+ *
+ *  Author:
+ *      Anthony Harivel <aharivel@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef VMSR_ENERGY_H
+#define VMSR_ENERGY_H
+
+#include "qemu/osdep.h"
+#include <numa.h>
+#include <stdint.h>
+#include "hw/i386/topology.h"
+
+/*
+ * Define the interval time in micro seconds between 2 samples of
+ * energy related MSRs
+ */
+#define MSR_ENERGY_THREAD_SLEEP_US 1000000.0
+
+/*
+ * Thread statistic
+ * @ thread_id: TID (thread ID)
+ * @ is_vcpu: true is thread is vCPU thread
+ * @ cpu_id: CPU number last executed on
+ * @ vcpu_id: vCPU ID
+ * @ numa_node_id:node number of the CPU
+ * @ vpkg: virtual package number
+ * @ acpi_id: APIC id of the vCPU
+ * @ utime: amount of clock ticks the thread
+ *          has been scheduled in User mode
+ * @ stime: amount of clock ticks the thread
+ *          has been scheduled in System mode
+ * @ delta_ticks: delta of utime+stime between
+ *          the two samples (before/after sleep)
+ */
+struct thread_stat {
+    unsigned int thread_id;
+    bool is_vcpu;
+    unsigned int cpu_id;
+    unsigned int vcpu_id;
+    unsigned int numa_node_id;
+    unsigned int vpkg;
+    unsigned long acpi_id;
+    unsigned long long *utime;
+    unsigned long long *stime;
+    unsigned long long delta_ticks;
+};
+
+/*
+ * Package statistic
+ * @ e_start: package energy counter before the sleep
+ * @ e_end: package energy counter after the sleep
+ * @ e_delta: delta of package energy counter
+ * @ e_ratio: store the energy ratio of non-vCPU thread
+ * @ nb_vcpu: number of vCPU running on this package
+ */
+struct package_energy_stat {
+    uint64_t e_start;
+    uint64_t e_end;
+    uint64_t e_delta;
+    uint64_t e_ratio;
+    unsigned int nb_vcpu;
+};
+
+typedef struct thread_stat thread_stat;
+typedef struct package_energy_stat package_energy_stat;
+
+uint64_t vmsr_read_msr(uint32_t reg, unsigned int cpu_id,
+                       unsigned int tid, const char *path);
+void vmsr_delta_ticks(thread_stat *thd_stat, int i);
+unsigned int vmsr_get_maxcpus(unsigned int package_num);
+unsigned int vmsr_get_max_physical_package(unsigned int max_cpus);
+int vmsr_read_thread_stat(struct thread_stat *thread, int pid, int index);
+pid_t *vmsr_get_thread_ids(pid_t pid, unsigned int *num_threads);
+double vmsr_get_ratio(package_energy_stat *pkg_stat,
+                        thread_stat *thd_stat,
+                        int maxticks, int i);
+void vmsr_init_topo_info(X86CPUTopoInfo *topo_info, const MachineState *ms);
+#endif /* VMSR_ENERGY_H */
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-01-25  7:22 ` [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu Anthony Harivel
@ 2024-01-29 19:29   ` Daniel P. Berrangé
  2024-02-20 14:00     ` Anthony Harivel
  2024-03-05 14:58     ` Anthony Harivel
  2024-01-30  9:13   ` Daniel P. Berrangé
  2024-01-30  9:39   ` Daniel P. Berrangé
  2 siblings, 2 replies; 32+ messages in thread
From: Daniel P. Berrangé @ 2024-01-29 19:29 UTC (permalink / raw)
  To: Anthony Harivel; +Cc: pbonzini, mtosatti, qemu-devel, vchundur

On Thu, Jan 25, 2024 at 08:22:14AM +0100, Anthony Harivel wrote:
> diff --git a/docs/specs/rapl-msr.rst b/docs/specs/rapl-msr.rst
> new file mode 100644
> index 000000000000..04d27c198fc0
> --- /dev/null
> +++ b/docs/specs/rapl-msr.rst
> @@ -0,0 +1,133 @@
> +================
> +RAPL MSR support
> +================

> +
> +Current Limitations
> +-------------------
> +
> +- Works only on Intel host CPUs because AMD CPUs are using different MSR
> +  addresses.

The privileged helper program is validating an allow list of MSRs.

If those MSRs are only correct on Intel hosts, then the validation
is incomplete, and it could be allowing unprivileged processes on
AMD hosts to access forbidden MSRS whose address happen to clash
with the Intel RAPL MSRs.

IOW, the privileged helper needs to call cpuid() and validate that
the current host vendor is Intel.

I suspect we also need a feature check of some kind to validate
that the intel processor supports this features, since old ones
definitely didn't, and we shouldn't assume all future ones will
either.

> diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
> index 882e37e12c5b..ee2fe8817833 100644
> --- a/include/sysemu/kvm_int.h
> +++ b/include/sysemu/kvm_int.h
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index 76a66246eb72..e6cb315c0a90 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -16,16 +16,22 @@
>  #include "qapi/qapi-events-run-state.h"
>  #include "qapi/error.h"
>  #include "qapi/visitor.h"
> +#include <math.h>
> +#include <stdint.h>

stdint.h always gets included via osdep.h	

>  #include <sys/ioctl.h>
>  #include <sys/utsname.h>
>  #include <sys/syscall.h>
> +#include <sys/resource.h>
> +#include <sys/time.h>
>  
>  #include <linux/kvm.h>
> +#include <unistd.h>

unistd.h also always gets included via osdep.h

>  #include "standard-headers/asm-x86/kvm_para.h"
>  #include "hw/xen/interface/arch-x86/cpuid.h"
>  
>  #include "cpu.h"
>  #include "host-cpu.h"
> +#include "vmsr_energy.h"
>  #include "sysemu/sysemu.h"
>  #include "sysemu/hw_accel.h"
>  #include "sysemu/kvm_int.h"

> @@ -2509,6 +2558,265 @@ static void register_smram_listener(Notifier *n, void *unused)
>                                   &smram_address_space, 1, "kvm-smram");
>  }
>  
> +static void *kvm_msr_energy_thread(void *data)
> +{
> +    KVMState *s = data;
> +    struct KVMMsrEnergy *vmsr = &s->msr_energy;
> +
> +    g_autofree package_energy_stat *pkg_stat = NULL;
> +    g_autofree thread_stat *thd_stat = NULL;
> +    g_autofree pid_t *thread_ids = NULL;
> +    g_autofree CPUState *cpu = NULL;
> +    unsigned int maxpkgs, maxcpus, maxticks;
> +    g_autofree unsigned int *vpkgs_energy_stat = NULL;
> +    unsigned int num_threads = 0;
> +    unsigned int tmp_num_threads = 0;
> +    pid_t pid;
> +
> +    X86CPUTopoIDs topo_ids;
> +
> +
> +    rcu_register_thread();
> +
> +    /* Get QEMU PID*/
> +    pid = getpid();
> +
> +    /* Nb of CPUS per packages */
> +    maxcpus = vmsr_get_maxcpus(0);
> +
> +    /* Nb of Physical Packages on the system */
> +    maxpkgs = vmsr_get_max_physical_package(maxcpus);
> +
> +    /* Those MSR values should not change as well */
> +    vmsr->msr_unit  = vmsr_read_msr(MSR_RAPL_POWER_UNIT, 0, pid,
> +                                    s->msr_energy.socket_path);
> +    vmsr->msr_limit = vmsr_read_msr(MSR_PKG_POWER_LIMIT, 0, pid,
> +                                    s->msr_energy.socket_path);
> +    vmsr->msr_info  = vmsr_read_msr(MSR_PKG_POWER_INFO, 0, pid,
> +                                    s->msr_energy.socket_path);
> +
> +    /* Allocate memory for each package energy status */
> +    pkg_stat = (package_energy_stat *)
> +        g_new0(package_energy_stat, maxpkgs);

Return value of g_new0 shouldn't need casting.

> +
> +    /* Pre-allocate memory for thread stats */
> +    thd_stat = g_new0(thread_stat, 1);
> +
> +    /* Pre-allocate memory for holding Virtual Package Energy counter */
> +    vpkgs_energy_stat = g_new0(unsigned int, vmsr->sockets);
> +
> +    /*
> +     * Max numbers of ticks per package
> +     * time in second * number of ticks/second * Number of cores / package
> +     * ex: for 100 ticks/second/CPU, 12 CPUs per Package gives 1200 ticks max
> +     */
> +    maxticks = (MSR_ENERGY_THREAD_SLEEP_US / 1000000)
> +                    * sysconf(_SC_CLK_TCK) * maxcpus;
> +
> +    while (true) {
> +        /* Get all qemu threads id */
> +        thread_ids = vmsr_get_thread_ids(pid, &num_threads);
> +
> +        if (thread_ids == NULL) {
> +            goto clean;
> +        }
> +
> +        if (tmp_num_threads < num_threads) {
> +            thd_stat = g_renew(thread_stat, thd_stat, num_threads);
> +        }
> +
> +        tmp_num_threads = num_threads;
> +
> +        /* Populate all the thread stats */
> +        for (int i = 0; i < num_threads; i++) {
> +            thd_stat[i].utime = g_new0(unsigned long long, 2);
> +            thd_stat[i].stime = g_new0(unsigned long long, 2);
> +            thd_stat[i].thread_id = thread_ids[i];
> +            vmsr_read_thread_stat(&thd_stat[i], pid, 0);
> +            thd_stat[i].numa_node_id = numa_node_of_cpu(thd_stat[i].cpu_id);
> +        }
> +
> +        /* Retrieve all packages power plane energy counter */
> +        for (int i = 0; i <= maxpkgs; i++) {
> +            for (int j = 0; j < num_threads; j++) {
> +                /*
> +                 * Use the first thread we found that ran on the CPU
> +                 * of the package to read the packages energy counter
> +                 */

This says we're using a thread ID

> +                if (thd_stat[j].numa_node_id == i) {
> +                    pkg_stat[i].e_start =
> +                    vmsr_read_msr(MSR_PKG_ENERGY_STATUS, i, pid,

but here we're using a pid ID, which is the thread ID of the initial
thread.

> +                                  s->msr_energy.socket_path);
> +                    break;
> +                }
> +            }
> +        }

This API design for vmsr_read_msr() is incredibly inefficient.
We're making (maxpkgs * num_threads) calls to vmsr_read_msr(),
and every one of those is opening and closing the socket.

Why isn't QEMU opening the socket once and then sending all
the requests over the same socket ?

> +
> +        /* Sleep a short period while the other threads are working */
> +        usleep(MSR_ENERGY_THREAD_SLEEP_US);
> +
> +        /*
> +         * Retrieve all packages power plane energy counter
> +         * Calculate the delta of all packages
> +         */
> +        for (int i = 0; i <= maxpkgs; i++) {
> +            for (int j = 0; j < num_threads; j++) {
> +                /*
> +                 * Use the first thread we found that ran on the CPU
> +                 * of the package to read the packages energy counter
> +                 */
> +                if (thd_stat[j].numa_node_id == i) {
> +                    pkg_stat[i].e_end =
> +                       vmsr_read_msr(MSR_PKG_ENERGY_STATUS,
> +                                     thd_stat[j].cpu_id,

This is passing the 'cpu_id' field, but in the previous call to
vmsr_read_msr 20 lines earlier, we seemd to pass a value that
was the NUMA node id. 

> +                                     thd_stat[j].thread_id,

This is now passing a thread_id, while previously  we passed the
PID id.

> +                                     s->msr_energy.socket_path);
> +                    /*
> +                     * Prevent the case we have migrate the VM
> +                     * during the sleep period or any other cases
> +                     * were energy counter might be lower after
> +                     * the sleep.
> +                     */
> +                    if (pkg_stat[i].e_end > pkg_stat[i].e_start) {
> +                        pkg_stat[i].e_delta =
> +                            pkg_stat[i].e_end - pkg_stat[i].e_start;
> +                    } else {
> +                        pkg_stat[i].e_delta = 0;
> +                    }
> +                    break;
> +                }
> +            }
> +        }
> +
> +        /* Delta of ticks spend by each thread between the sample */
> +        for (int i = 0; i < num_threads; i++) {
> +            if (vmsr_read_thread_stat(&thd_stat[i], pid, 1) != 0) {
> +                /*
> +                 * We don't count the dead thread
> +                 * i.e threads that existed before the sleep
> +                 * and not anymore
> +                 */
> +                thd_stat[i].delta_ticks = 0;
> +            } else {
> +                vmsr_delta_ticks(thd_stat, i);
> +            }
> +        }
> +
> +        /*
> +         * Identify the vCPU threads
> +         * Calculate the Number of vCPU per package
> +         */
> +        CPU_FOREACH(cpu) {
> +            for (int i = 0; i < num_threads; i++) {
> +                if (cpu->thread_id == thd_stat[i].thread_id) {
> +                    thd_stat[i].is_vcpu = true;
> +                    thd_stat[i].vcpu_id = cpu->cpu_index;
> +                    pkg_stat[thd_stat[i].numa_node_id].nb_vcpu++;
> +                    thd_stat[i].acpi_id = kvm_arch_vcpu_id(cpu);
> +                    break;
> +                }
> +            }
> +        }
> +
> +        /* Retrieve the virtual package number of each vCPU */
> +        for (int i = 0; i < vmsr->x86_cpu_list->len; i++) {
> +            for (int j = 0; j < num_threads; j++) {
> +                if ((thd_stat[j].acpi_id == vmsr->x86_cpu_list->cpus[i].arch_id)
> +                    && (thd_stat[j].is_vcpu == true)) {
> +                    x86_topo_ids_from_apicid(thd_stat[j].acpi_id,
> +                        &vmsr->topo_info, &topo_ids);
> +                    thd_stat[j].vpkg = topo_ids.pkg_id;
> +                }
> +            }
> +        }
> +
> +        /* Calculate the total energy of all non-vCPU thread */
> +        for (int i = 0; i < num_threads; i++) {
> +            double temp;
> +            if ((thd_stat[i].is_vcpu != true) &&
> +                (thd_stat[i].delta_ticks > 0)) {
> +                temp = vmsr_get_ratio(pkg_stat, thd_stat, maxticks, i);
> +                pkg_stat[thd_stat[i].numa_node_id].e_ratio
> +                    += (uint64_t)lround(temp);
> +            }
> +        }
> +
> +        /* Calculate the ratio per non-vCPU thread of each package */
> +        for (int i = 0; i <= maxpkgs; i++) {
> +            if (pkg_stat[i].nb_vcpu > 0) {
> +                pkg_stat[i].e_ratio = pkg_stat[i].e_ratio / pkg_stat[i].nb_vcpu;
> +            }
> +        }
> +
> +        /*
> +         * Calculate the energy for each Package:
> +         * Energy Package = sum of each vCPU energy that belongs to the package
> +         */
> +        for (int i = 0; i < num_threads; i++) {
> +            double temp;
> +
> +            if ((thd_stat[i].is_vcpu == true) && \
> +                    (thd_stat[i].delta_ticks > 0)) {
> +                temp = vmsr_get_ratio(pkg_stat, thd_stat, maxticks, i);
> +
> +                vpkgs_energy_stat[thd_stat[i].vpkg] += (uint64_t)lround(temp);
> +                vpkgs_energy_stat[thd_stat[i].vpkg] +=
> +                    pkg_stat[thd_stat[i].numa_node_id].e_ratio;
> +            }
> +        }
> +
> +        /*
> +         * Finally populate the vmsr register of each vCPU with the total
> +         * package value to emulate the real hardware where each CPU return the
> +         * value of the package it belongs.
> +         */
> +        for (int i = 0; i < num_threads; i++) {
> +            if ((thd_stat[i].is_vcpu == true) && \
> +                    (thd_stat[i].delta_ticks > 0)) {
> +                vmsr->msr_value[thd_stat[i].vcpu_id] = \
> +                                            vpkgs_energy_stat[thd_stat[i].vpkg];
> +          }
> +        }
> +
> +        /* Zero out the memory */
> +        for (int i = 0; i < num_threads; i++) {
> +            memset(thd_stat[i].utime, 0, 2 * sizeof(unsigned long long));
> +            memset(thd_stat[i].stime, 0, 2 * sizeof(unsigned long long));
> +        }
> +        memset(thd_stat, 0, num_threads * sizeof(thread_stat));
> +        memset(thread_ids, 0, sizeof(pid_t));
> +    }
> +
> +clean:
> +    rcu_unregister_thread();
> +    return NULL;
> +}
> +
> +static int kvm_msr_energy_thread_init(KVMState *s, MachineState *ms)
> +{
> +    struct KVMMsrEnergy *r = &s->msr_energy;
> +
> +    vmsr_init_topo_info(&r->topo_info, ms);
> +
> +    /* Retrieve the number of vCPU */
> +    r->cpus = ms->smp.cpus;
> +
> +    /* Retrieve the number of sockets */
> +    r->sockets = ms->smp.sockets;
> +
> +    /* Allocate register memory (MSR_PKG_STATUS) for each vCPU */
> +    r->msr_value = g_new0(uint64_t, r->cpus);
> +
> +    /* Retrieve the CPUArchIDlist */
> +    r->x86_cpu_list = x86_possible_cpu_arch_ids(ms);
> +
> +    qemu_thread_create(&r->msr_thr, "kvm-msr",
> +                       kvm_msr_energy_thread,
> +                       s, QEMU_THREAD_JOINABLE);
> +
> +    return 0;
> +}
> +
>  int kvm_arch_get_default_type(MachineState *ms)
>  {
>      return 0;
> @@ -2711,6 +3019,46 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>                           strerror(-ret));
>              exit(1);
>          }
> +
> +        if (s->msr_energy.enable == true) {
> +
> +            r = kvm_filter_msr(s, MSR_RAPL_POWER_UNIT,
> +                               kvm_rdmsr_rapl_power_unit, NULL);
> +            if (!r) {
> +                error_report("Could not install MSR_RAPL_POWER_UNIT \
> +                                handler: %s",
> +                             strerror(-ret));
> +                exit(1);
> +            }
> +
> +            r = kvm_filter_msr(s, MSR_PKG_POWER_LIMIT,
> +                               kvm_rdmsr_pkg_power_limit, NULL);
> +            if (!r) {
> +                error_report("Could not install MSR_PKG_POWER_LIMIT \
> +                                handler: %s",
> +                             strerror(-ret));
> +                exit(1);
> +            }
> +
> +            r = kvm_filter_msr(s, MSR_PKG_POWER_INFO,
> +                               kvm_rdmsr_pkg_power_info, NULL);
> +            if (!r) {
> +                error_report("Could not install MSR_PKG_POWER_INFO \
> +                                handler: %s",
> +                             strerror(-ret));
> +                exit(1);
> +            }
> +            r = kvm_filter_msr(s, MSR_PKG_ENERGY_STATUS,
> +                               kvm_rdmsr_pkg_energy_status, NULL);
> +            if (!r) {
> +                error_report("Could not install MSR_PKG_ENERGY_STATUS \
> +                                handler: %s",
> +                             strerror(-ret));
> +                exit(1);
> +            } else {
> +                kvm_msr_energy_thread_init(s, ms);
> +            }
> +        }
>      }
>  
>      return 0;
> diff --git a/target/i386/kvm/meson.build b/target/i386/kvm/meson.build
> index 84d9143e6029..16010638df69 100644
> --- a/target/i386/kvm/meson.build
> +++ b/target/i386/kvm/meson.build
> @@ -3,6 +3,7 @@ i386_kvm_ss = ss.source_set()
>  i386_kvm_ss.add(files(
>    'kvm.c',
>    'kvm-cpu.c',
> +  'vmsr_energy.c',
>  ))
>  
>  i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen-emu.c'))
> diff --git a/target/i386/kvm/vmsr_energy.c b/target/i386/kvm/vmsr_energy.c
> new file mode 100644
> index 000000000000..27588630efa4
> --- /dev/null
> +++ b/target/i386/kvm/vmsr_energy.c
> @@ -0,0 +1,295 @@
> +/*
> + * QEMU KVM support -- x86 virtual energy-related MSR.
> + *
> + * Copyright 2023 Red Hat, Inc. 2023
> + *
> + *  Author:
> + *      Anthony Harivel <aharivel@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "vmsr_energy.h"
> +#include "qapi/error.h"
> +#include "io/channel.h"
> +#include "io/channel-socket.h"
> +#include "hw/boards.h"
> +
> +#define MAX_PATH_LEN 256
> +#define MAX_LINE_LEN 500
> +
> +static char *compute_default_paths(void)
> +{
> +    g_autofree char *state = qemu_get_local_state_dir();
> +
> +    return g_build_filename(state, "run", "qemu-vmsr-helper.sock", NULL);
> +}
> +
> +static int vmsr_helper_socket_read(QIOChannel *ioc,
> +                                  void *buf, int sz, Error **errp)
> +{
> +    ssize_t r = qio_channel_read_all(ioc, buf, sz, errp);
> +
> +    if (r < 0) {
> +        object_unref(OBJECT(ioc));
> +        ioc = NULL;
> +        return -EINVAL;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vmsr_helper_socket_write(QIOChannel *ioc,
> +                                   int fd,
> +                                   const void *buf, int sz, Error **errp)
> +{
> +    size_t nfds = (fd != -1);
> +    while (sz > 0) {
> +        struct iovec iov;
> +        ssize_t n_written;
> +
> +        iov.iov_base = (void *)buf;
> +        iov.iov_len = sz;
> +        n_written = qio_channel_writev_full(QIO_CHANNEL(ioc), &iov, 1,
> +                                            nfds ? &fd : NULL, nfds, 0, errp);
> +
> +        if (n_written <= 0) {
> +            assert(n_written != QIO_CHANNEL_ERR_BLOCK);
> +            object_unref(OBJECT(ioc));
> +            ioc = NULL;
> +            return n_written < 0 ? -EINVAL : 0;
> +        }
> +
> +        nfds = 0;
> +        buf += n_written;
> +        sz -= n_written;
> +    }
> +
> +    return 0;
> +}

THis method seems pointless. It is being called with fd == -1.
The caller should just directly call qio_channel_write_all
which will write all data.

> +
> +uint64_t vmsr_read_msr(uint32_t reg, unsigned int cpu_id, uint32_t tid,
> +                       const char *path)
> +{
> +    uint64_t data = 0;
> +    char *socket_path = NULL;
> +    unsigned int buffer[3];
> +
> +    if (path == NULL) {
> +        socket_path = compute_default_paths();
> +    } else {
> +        socket_path = g_strdup(path);
> +    }
> +
> +    SocketAddress saddr = {
> +        .type = SOCKET_ADDRESS_TYPE_UNIX,
> +        .u.q_unix.path = socket_path
> +    };
> +    QIOChannelSocket *sioc = qio_channel_socket_new();
> +    Error *local_err = NULL;
> +
> +    int r;
> +
> +    qio_channel_set_name(QIO_CHANNEL(sioc), "vmsr-helper");
> +    qio_channel_socket_connect_sync(sioc,
> +                                    &saddr,
> +                                    &local_err);
> +    g_free(socket_path);
> +    if (local_err) {
> +        goto out_close;
> +    }
> +
> +    /*
> +     * Send the required arguments:
> +     * 1. RAPL MSR register to read
> +     * 2. On which CPU ID
> +     * 3. From which vCPU (Thread ID)
> +     */
> +    buffer[0] = reg;
> +    buffer[1] = cpu_id;
> +    buffer[2] = tid;

IMHO it is nicer to declare a packed struct for the fields
being sent, and share its decl between the client and server
code.

> +
> +    r = vmsr_helper_socket_write(QIO_CHANNEL(sioc),
> +                                 -1,
> +                                 &buffer, sizeof(buffer),
> +                                 &local_err);
> +    if (r < 0) {
> +        goto out_close;
> +    }
> +
> +    r = vmsr_helper_socket_read(QIO_CHANNEL(sioc),
> +                                &data, sizeof(data),
> +                                &local_err);
> +    if (r < 0) {
> +        data = 0;
> +        goto out_close;
> +    }
> +
> +out_close:
> +    /* Close socket. */
> +    qio_channel_close(QIO_CHANNEL(sioc), NULL);
> +    object_unref(OBJECT(sioc));
> +    return data;
> +}
> +
> +/* Retrieve the max number of physical CPU on the package */
> +unsigned int vmsr_get_maxcpus(unsigned int package_num)
> +{
> +    int k, ncpus;
> +    unsigned int maxcpus;
> +    struct bitmask *cpus;
> +
> +    cpus = numa_allocate_cpumask();
> +    ncpus = cpus->size;
> +
> +    if (numa_node_to_cpus(package_num, cpus) < 0) {
> +        return 0;
> +    }
> +
> +    maxcpus = 0;
> +    for (k = 0; k < ncpus; k++) {
> +        if (numa_bitmask_isbitset(cpus, k)) {
> +            maxcpus++;
> +        }
> +    }
> +
> +    return maxcpus;
> +}
> +
> +/* Retrieve the maximum number of physical packages */
> +unsigned int vmsr_get_max_physical_package(unsigned int max_cpus)
> +{
> +    unsigned int packageCount = 0;
> +    const char *dir = "/sys/devices/system/cpu/";
> +    int *uniquePackages;

Mark this as g_autofree() to eliminuate the multiple calls
go g_free()in each exit path

> +
> +    char *filePath;
> +    FILE *file;
> +
> +    uniquePackages = g_new0(int, max_cpus);
> +
> +    for (int i = 0; i < max_cpus; i++) {
> +        filePath = g_build_filename(dir, g_strdup_printf("cpu%d", i),
> +                                     "topology/physical_package_id", NULL);

Leaking the g_strdup_printf result.

Also filePath is allocated on every iteration, but only freed
outside the loop, so another leak. Declare filePath inside
the loop and mark it g_autofree

> +
> +        file = fopen(filePath, "r");
> +
> +        if (file == NULL) {
> +            perror("Error opening file");
> +            g_free(filePath);
> +            g_free(uniquePackages);
> +            return 0;
> +        }
> +
> +        char packageId[10];
> +        if (fgets(packageId, sizeof(packageId), file) == NULL) {
> +            packageCount = 0;
> +        }
> +
> +        fclose(file);
> +
> +        int currentPackageId = atoi(packageId);
> +
> +        bool isUnique = true;
> +        for (int j = 0; j < packageCount; j++) {
> +            if (uniquePackages[j] == currentPackageId) {
> +                isUnique = false;
> +                break;
> +            }
> +        }
> +
> +        if (isUnique) {
> +            uniquePackages[packageCount] = currentPackageId;
> +            packageCount++;
> +
> +            if (packageCount >= max_cpus) {
> +                break;
> +            }
> +        }
> +    }
> +
> +    g_free(filePath);
> +    g_free(uniquePackages);
> +    return (packageCount == 0) ? 1 : packageCount;
> +}
> +int vmsr_read_thread_stat(struct thread_stat *thread, int pid, int index)
> +{
> +    char *path;
> +    path = g_new0(char, MAX_PATH_LEN);

This pointer is overwritten on the very next line, so is leaked

> +
> +    path = g_build_filename(g_strdup_printf("/proc/%u/task/%d/stat", pid, \
> +            thread->thread_id), NULL);

This result is never freed, and the intermediate g_strdup_printf() is also
not freed. Using g_build_filename is pointless if you're going to use
g_strdup_printf() to buld the entire filename in one go.

> +
> +    FILE *file = fopen(path, "r");
> +    if (file == NULL) {
> +        return -1;
> +    }
> +
> +    if (fscanf(file, "%*d (%*[^)]) %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u"
> +        " %llu %llu %*d %*d %*d %*d %*d %*d %*u %*u %*d %*u %*u"
> +        " %*u %*u %*u %*u %*u %*u %*u %*u %*u %*d %*u %*u %u",
> +           &thread->utime[index], &thread->stime[index], &thread->cpu_id) != 3)
> +        return -1;
> +
> +    fclose(file);
> +    return 0;
> +}
> +
> +/* Read QEMU stat task folder to retrieve all QEMU threads ID */
> +pid_t *vmsr_get_thread_ids(pid_t pid, unsigned int *num_threads)
> +{
> +    char *path = g_build_filename("/proc", g_strdup_printf("%d/task", pid), NULL);

Leaking the g_strdup_printf result

> +
> +    DIR *dir = opendir(path);
> +    if (dir == NULL) {
> +        perror("opendir");
> +        g_free(path);

Mark 'path' as g_autofree so g_free() isn't needed in every exit-path

> +        return NULL;
> +    }
> +
> +    pid_t *thread_ids = NULL;
> +    unsigned int thread_count = 0;
> +
> +    struct dirent *ent;
> +    while ((ent = readdir(dir)) != NULL) {
> +        if (ent->d_name[0] == '.') {
> +            continue;
> +        }
> +        pid_t tid = atoi(ent->d_name);
> +        if (pid != tid) {
> +            thread_ids = g_renew(pid_t, thread_ids, (thread_count + 1));
> +            thread_ids[thread_count] = tid;
> +            thread_count++;
> +        }
> +    }
> +
> +    closedir(dir);
> +
> +    *num_threads = thread_count;
> +    g_free(path);
> +    return thread_ids;
> +}

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-01-29 19:29   ` Daniel P. Berrangé
@ 2024-02-20 14:00     ` Anthony Harivel
  2024-02-20 15:00       ` Daniel P. Berrangé
  2024-03-05 14:58     ` Anthony Harivel
  1 sibling, 1 reply; 32+ messages in thread
From: Anthony Harivel @ 2024-02-20 14:00 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: pbonzini, mtosatti, qemu-devel, vchundur

Daniel P. Berrangé, Jan 29, 2024 at 20:29:
> On Thu, Jan 25, 2024 at 08:22:14AM +0100, Anthony Harivel wrote:
> > diff --git a/docs/specs/rapl-msr.rst b/docs/specs/rapl-msr.rst
> > new file mode 100644
> > index 000000000000..04d27c198fc0
> > --- /dev/null
> > +++ b/docs/specs/rapl-msr.rst
> > @@ -0,0 +1,133 @@
> > +================
> > +RAPL MSR support
> > +================
>
> > +
> > +Current Limitations
> > +-------------------
> > +
> > +- Works only on Intel host CPUs because AMD CPUs are using different MSR
> > +  addresses.
>
> The privileged helper program is validating an allow list of MSRs.
>
> If those MSRs are only correct on Intel hosts, then the validation
> is incomplete, and it could be allowing unprivileged processes on
> AMD hosts to access forbidden MSRS whose address happen to clash
> with the Intel RAPL MSRs.
>
> IOW, the privileged helper needs to call cpuid() and validate that
> the current host vendor is Intel.
>
> I suspect we also need a feature check of some kind to validate
> that the intel processor supports this features, since old ones
> definitely didn't, and we shouldn't assume all future ones will
> either.
>

To validate that the processor supports the RAPL feature I propose
to check this on the Host:

$ cat /sys/class/powercap/intel-rapl/enabled
1


The only down side is that INTEL RAPL drivers needs to be
mounted then. We don't need it because we directly read the MSRs.

Regards,
Anthony



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-02-20 14:00     ` Anthony Harivel
@ 2024-02-20 15:00       ` Daniel P. Berrangé
  0 siblings, 0 replies; 32+ messages in thread
From: Daniel P. Berrangé @ 2024-02-20 15:00 UTC (permalink / raw)
  To: Anthony Harivel; +Cc: pbonzini, mtosatti, qemu-devel, vchundur

On Tue, Feb 20, 2024 at 03:00:56PM +0100, Anthony Harivel wrote:
> Daniel P. Berrangé, Jan 29, 2024 at 20:29:
> > On Thu, Jan 25, 2024 at 08:22:14AM +0100, Anthony Harivel wrote:
> > > diff --git a/docs/specs/rapl-msr.rst b/docs/specs/rapl-msr.rst
> > > new file mode 100644
> > > index 000000000000..04d27c198fc0
> > > --- /dev/null
> > > +++ b/docs/specs/rapl-msr.rst
> > > @@ -0,0 +1,133 @@
> > > +================
> > > +RAPL MSR support
> > > +================
> >
> > > +
> > > +Current Limitations
> > > +-------------------
> > > +
> > > +- Works only on Intel host CPUs because AMD CPUs are using different MSR
> > > +  addresses.
> >
> > The privileged helper program is validating an allow list of MSRs.
> >
> > If those MSRs are only correct on Intel hosts, then the validation
> > is incomplete, and it could be allowing unprivileged processes on
> > AMD hosts to access forbidden MSRS whose address happen to clash
> > with the Intel RAPL MSRs.
> >
> > IOW, the privileged helper needs to call cpuid() and validate that
> > the current host vendor is Intel.
> >
> > I suspect we also need a feature check of some kind to validate
> > that the intel processor supports this features, since old ones
> > definitely didn't, and we shouldn't assume all future ones will
> > either.
> >
> 
> To validate that the processor supports the RAPL feature I propose
> to check this on the Host:
> 
> $ cat /sys/class/powercap/intel-rapl/enabled
> 1
> 
> 
> The only down side is that INTEL RAPL drivers needs to be
> mounted then. We don't need it because we directly read the MSRs.

Awkward. I've looked around in the kernel drivers and didn't uncover
any better option than this.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-01-29 19:29   ` Daniel P. Berrangé
  2024-02-20 14:00     ` Anthony Harivel
@ 2024-03-05 14:58     ` Anthony Harivel
  1 sibling, 0 replies; 32+ messages in thread
From: Anthony Harivel @ 2024-03-05 14:58 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: pbonzini, mtosatti, qemu-devel, vchundur

Hi Daniel,

> > +
> > +        /* Retrieve all packages power plane energy counter */
> > +        for (int i = 0; i <= maxpkgs; i++) {
> > +            for (int j = 0; j < num_threads; j++) {
> > +                /*
> > +                 * Use the first thread we found that ran on the CPU
> > +                 * of the package to read the packages energy counter
> > +                 */
>
> This says we're using a thread ID
>
> > +                if (thd_stat[j].numa_node_id == i) {
> > +                    pkg_stat[i].e_start =
> > +                    vmsr_read_msr(MSR_PKG_ENERGY_STATUS, i, pid,
>
> but here we're using a pid ID, which is the thread ID of the initial
> thread.
>
> > +                                  s->msr_energy.socket_path);
> > +                    break;
> > +                }
> > +            }
> > +        }
>
> This API design for vmsr_read_msr() is incredibly inefficient.
> We're making (maxpkgs * num_threads) calls to vmsr_read_msr(),
> and every one of those is opening and closing the socket.
>
> Why isn't QEMU opening the socket once and then sending all
> the requests over the same socket ?
>

The usage of pid here is a mistake, thanks for pointing this out.

However, I'm more sceptical about the fact that the loop is inefficient. 
The confusion could definitely be because of the poor variable naming, 
and I apologize about that.
Let me try to explain what it's supposed to do:
Imagine we are running on machine that has i packages. QEMU has 
j threads running on whichever packages. We need to get the current 
packages energy of each packages that are used by the QEMU threads. 
(could be all i packages, only 1, 2.. we don't know what we need yet) 
So it loops first on the packages "0", and look if any thread has run 
on this packages. 
If no, test the next thread. 
if yes, we need the value, we call the vmsr_read_msr() then break and 
now loop for the next package, i.e package "1". And this until all 
packages has been tested.

So in the end, we 'only' have maximum "maxpkgs" calls of vmsr_read_msr().

Hope that's ok and that clear up the confusion!

Regards,
Anthony

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-01-25  7:22 ` [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu Anthony Harivel
  2024-01-29 19:29   ` Daniel P. Berrangé
@ 2024-01-30  9:13   ` Daniel P. Berrangé
  2024-03-04 14:41     ` Anthony Harivel
  2024-01-30  9:39   ` Daniel P. Berrangé
  2 siblings, 1 reply; 32+ messages in thread
From: Daniel P. Berrangé @ 2024-01-30  9:13 UTC (permalink / raw)
  To: Anthony Harivel; +Cc: pbonzini, mtosatti, qemu-devel, vchundur

On Thu, Jan 25, 2024 at 08:22:14AM +0100, Anthony Harivel wrote:
> Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
> interface (Running Average Power Limit) for advertising the accumulated
> energy consumption of various power domains (e.g. CPU packages, DRAM,
> etc.).
> 
> The consumption is reported via MSRs (model specific registers) like
> MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
> 64 bits registers that represent the accumulated energy consumption in
> micro Joules. They are updated by microcode every ~1ms.
> 
> For now, KVM always returns 0 when the guest requests the value of
> these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
> these MSRs dynamically in userspace.
> 
> To limit the amount of system calls for every MSR call, create a new
> thread in QEMU that updates the "virtual" MSR values asynchronously.
> 
> Each vCPU has its own vMSR to reflect the independence of vCPUs. The
> thread updates the vMSR values with the ratio of energy consumed of
> the whole physical CPU package the vCPU thread runs on and the
> thread's utime and stime values.
> 
> All other non-vCPU threads are also taken into account. Their energy
> consumption is evenly distributed among all vCPUs threads running on
> the same physical CPU package.
> 
> To overcome the problem that reading the RAPL MSR requires priviliged
> access, a socket communication between QEMU and the qemu-vmsr-helper is
> mandatory. You can specified the socket path in the parameter.
> 
> This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock
> 
> Actual limitation:
> - Works only on Intel host CPU because AMD CPUs are using different MSR
>   adresses.
> 
> - Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at
>   the moment.
> 
> Signed-off-by: Anthony Harivel <aharivel@redhat.com>
> ---
>  accel/kvm/kvm-all.c           |  27 +++
>  docs/specs/index.rst          |   1 +
>  docs/specs/rapl-msr.rst       | 133 +++++++++++++
>  include/sysemu/kvm_int.h      |  17 ++
>  target/i386/cpu.h             |   8 +
>  target/i386/kvm/kvm.c         | 348 ++++++++++++++++++++++++++++++++++
>  target/i386/kvm/meson.build   |   1 +
>  target/i386/kvm/vmsr_energy.c | 295 ++++++++++++++++++++++++++++
>  target/i386/kvm/vmsr_energy.h |  87 +++++++++
>  9 files changed, 917 insertions(+)
>  create mode 100644 docs/specs/rapl-msr.rst
>  create mode 100644 target/i386/kvm/vmsr_energy.c
>  create mode 100644 target/i386/kvm/vmsr_energy.h
> 

> diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
> index 882e37e12c5b..ee2fe8817833 100644
> --- a/include/sysemu/kvm_int.h
> +++ b/include/sysemu/kvm_int.h
> @@ -14,6 +14,8 @@
>  #include "qemu/accel.h"
>  #include "qemu/queue.h"
>  #include "sysemu/kvm.h"
> +#include "hw/boards.h"
> +#include "hw/i386/topology.h"
>  
>  typedef struct KVMSlot
>  {
> @@ -48,6 +50,20 @@ typedef struct KVMMemoryListener {
>  
>  #define KVM_MSI_HASHTAB_SIZE    256
>  
> +struct KVMMsrEnergy {
> +    bool enable;
> +    char *socket_path;
> +    QemuThread msr_thr;
> +    unsigned int cpus;
> +    unsigned int sockets;
> +    X86CPUTopoInfo topo_info;
> +    const CPUArchIdList *cpu_list;
> +    uint64_t *msr_value;
> +    uint64_t msr_unit;
> +    uint64_t msr_limit;
> +    uint64_t msr_info;
> +};
> +
>  enum KVMDirtyRingReaperState {
>      KVM_DIRTY_RING_REAPER_NONE = 0,
>      /* The reaper is sleeping */
> @@ -114,6 +130,7 @@ struct KVMState
>      bool kvm_dirty_ring_with_bitmap;
>      uint64_t kvm_eager_split_size;  /* Eager Page Splitting chunk size */
>      struct KVMDirtyRingReaper reaper;
> +    struct KVMMsrEnergy msr_energy;
>      NotifyVmexitOption notify_vmexit;
>      uint32_t notify_window;
>      uint32_t xen_version;

> @@ -2509,6 +2558,265 @@ static void register_smram_listener(Notifier *n, void *unused)
>                                   &smram_address_space, 1, "kvm-smram");
>  }
>  
> +static void *kvm_msr_energy_thread(void *data)
> +{
> +    KVMState *s = data;
> +    struct KVMMsrEnergy *vmsr = &s->msr_energy;
> +
> +    g_autofree package_energy_stat *pkg_stat = NULL;
> +    g_autofree thread_stat *thd_stat = NULL;
> +    g_autofree pid_t *thread_ids = NULL;
> +    g_autofree CPUState *cpu = NULL;
> +    unsigned int maxpkgs, maxcpus, maxticks;
> +    g_autofree unsigned int *vpkgs_energy_stat = NULL;
> +    unsigned int num_threads = 0;
> +    unsigned int tmp_num_threads = 0;
> +    pid_t pid;
> +
> +    X86CPUTopoIDs topo_ids;
> +
> +
> +    rcu_register_thread();
> +
> +    /* Get QEMU PID*/
> +    pid = getpid();
> +
> +    /* Nb of CPUS per packages */
> +    maxcpus = vmsr_get_maxcpus(0);
> +
> +    /* Nb of Physical Packages on the system */
> +    maxpkgs = vmsr_get_max_physical_package(maxcpus);
> +
> +    /* Those MSR values should not change as well */
> +    vmsr->msr_unit  = vmsr_read_msr(MSR_RAPL_POWER_UNIT, 0, pid,
> +                                    s->msr_energy.socket_path);
> +    vmsr->msr_limit = vmsr_read_msr(MSR_PKG_POWER_LIMIT, 0, pid,
> +                                    s->msr_energy.socket_path);
> +    vmsr->msr_info  = vmsr_read_msr(MSR_PKG_POWER_INFO, 0, pid,
> +                                    s->msr_energy.socket_path);
> +
> +    /* Allocate memory for each package energy status */
> +    pkg_stat = (package_energy_stat *)
> +        g_new0(package_energy_stat, maxpkgs);
> +
> +    /* Pre-allocate memory for thread stats */
> +    thd_stat = g_new0(thread_stat, 1);
> +
> +    /* Pre-allocate memory for holding Virtual Package Energy counter */
> +    vpkgs_energy_stat = g_new0(unsigned int, vmsr->sockets);
> +
> +    /*
> +     * Max numbers of ticks per package
> +     * time in second * number of ticks/second * Number of cores / package
> +     * ex: for 100 ticks/second/CPU, 12 CPUs per Package gives 1200 ticks max
> +     */
> +    maxticks = (MSR_ENERGY_THREAD_SLEEP_US / 1000000)
> +                    * sysconf(_SC_CLK_TCK) * maxcpus;
> +
> +    while (true) {
> +        /* Get all qemu threads id */
> +        thread_ids = vmsr_get_thread_ids(pid, &num_threads);

This is allolcating thread_ids

> +
> +        if (thread_ids == NULL) {
> +            goto clean;
> +        }
> +
> +        if (tmp_num_threads < num_threads) {
> +            thd_stat = g_renew(thread_stat, thd_stat, num_threads);
> +        }
> +
> +        tmp_num_threads = num_threads;
> +
> +        /* Populate all the thread stats */
> +        for (int i = 0; i < num_threads; i++) {
> +            thd_stat[i].utime = g_new0(unsigned long long, 2);
> +            thd_stat[i].stime = g_new0(unsigned long long, 2);
> +            thd_stat[i].thread_id = thread_ids[i];
> +            vmsr_read_thread_stat(&thd_stat[i], pid, 0);
> +            thd_stat[i].numa_node_id = numa_node_of_cpu(thd_stat[i].cpu_id);
> +        }

This loop is allocating memory for utime & stime..

> +
> +        /* Zero out the memory */
> +        for (int i = 0; i < num_threads; i++) {
> +            memset(thd_stat[i].utime, 0, 2 * sizeof(unsigned long long));
> +            memset(thd_stat[i].stime, 0, 2 * sizeof(unsigned long long));
> +        }
> +        memset(thd_stat, 0, num_threads * sizeof(thread_stat));

..and here we're just wiping the thd_stat pointer, so this is surey a
memory leak of the stime & utime fields.


> +        memset(thread_ids, 0, sizeof(pid_t));

Nothing is free'ing thread_ids which was allocated earlier in the
loop.

> +    }
> +
> +clean:
> +    rcu_unregister_thread();
> +    return NULL;
> +}
> +
> +static int kvm_msr_energy_thread_init(KVMState *s, MachineState *ms)
> +{
> +    struct KVMMsrEnergy *r = &s->msr_energy;
> +
> +    vmsr_init_topo_info(&r->topo_info, ms);
> +
> +    /* Retrieve the number of vCPU */
> +    r->cpus = ms->smp.cpus;
> +
> +    /* Retrieve the number of sockets */
> +    r->sockets = ms->smp.sockets;
> +
> +    /* Allocate register memory (MSR_PKG_STATUS) for each vCPU */
> +    r->msr_value = g_new0(uint64_t, r->cpus);
> +
> +    /* Retrieve the CPUArchIDlist */
> +    r->x86_cpu_list = x86_possible_cpu_arch_ids(ms);
> +
> +    qemu_thread_create(&r->msr_thr, "kvm-msr",
> +                       kvm_msr_energy_thread,
> +                       s, QEMU_THREAD_JOINABLE);
> +
> +    return 0;
> +}
> +
>  int kvm_arch_get_default_type(MachineState *ms)
>  {
>      return 0;
> @@ -2711,6 +3019,46 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>                           strerror(-ret));
>              exit(1);
>          }
> +
> +        if (s->msr_energy.enable == true) {

This looks to be where we need to check that both the host CPU
vendor is intel, and the guest CPU vendor is intel, and that
the host CPU has the RAPL feature we're using.

> +
> +            r = kvm_filter_msr(s, MSR_RAPL_POWER_UNIT,
> +                               kvm_rdmsr_rapl_power_unit, NULL);
> +            if (!r) {
> +                error_report("Could not install MSR_RAPL_POWER_UNIT \
> +                                handler: %s",
> +                             strerror(-ret));
> +                exit(1);
> +            }
> +
> +            r = kvm_filter_msr(s, MSR_PKG_POWER_LIMIT,
> +                               kvm_rdmsr_pkg_power_limit, NULL);
> +            if (!r) {
> +                error_report("Could not install MSR_PKG_POWER_LIMIT \
> +                                handler: %s",
> +                             strerror(-ret));
> +                exit(1);
> +            }
> +
> +            r = kvm_filter_msr(s, MSR_PKG_POWER_INFO,
> +                               kvm_rdmsr_pkg_power_info, NULL);
> +            if (!r) {
> +                error_report("Could not install MSR_PKG_POWER_INFO \
> +                                handler: %s",
> +                             strerror(-ret));
> +                exit(1);
> +            }
> +            r = kvm_filter_msr(s, MSR_PKG_ENERGY_STATUS,
> +                               kvm_rdmsr_pkg_energy_status, NULL);
> +            if (!r) {
> +                error_report("Could not install MSR_PKG_ENERGY_STATUS \
> +                                handler: %s",
> +                             strerror(-ret));
> +                exit(1);
> +            } else {
> +                kvm_msr_energy_thread_init(s, ms);
> +            }
> +        }
>      }
>  
>      return 0;


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-01-30  9:13   ` Daniel P. Berrangé
@ 2024-03-04 14:41     ` Anthony Harivel
  2024-03-04 14:48       ` Daniel P. Berrangé
  0 siblings, 1 reply; 32+ messages in thread
From: Anthony Harivel @ 2024-03-04 14:41 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: pbonzini, mtosatti, qemu-devel, vchundur


Hi Daniel,

> > +        if (s->msr_energy.enable == true) {
>
> This looks to be where we need to check that both the host CPU
> vendor is intel, and the guest CPU vendor is intel, and that
> the host CPU has the RAPL feature we're using.
>
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Checking for the host cpu and RAPL enable is fine and done. 

But checking for guest CPU is confusing me. 
The RAPL feature is enable only with KVM enable. 
This means "-cpu" can only be "host" or its derivative that essentially 
copy the host CPU definition, no?
That means if we are already checking the host cpu we don't need to do 
anything for the guest, do we ?

Regards,
Anthony



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-03-04 14:41     ` Anthony Harivel
@ 2024-03-04 14:48       ` Daniel P. Berrangé
  2024-03-05 13:25         ` Anthony Harivel
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel P. Berrangé @ 2024-03-04 14:48 UTC (permalink / raw)
  To: Anthony Harivel; +Cc: pbonzini, mtosatti, qemu-devel, vchundur

On Mon, Mar 04, 2024 at 03:41:02PM +0100, Anthony Harivel wrote:
> 
> Hi Daniel,
> 
> > > +        if (s->msr_energy.enable == true) {
> >
> > This looks to be where we need to check that both the host CPU
> > vendor is intel, and the guest CPU vendor is intel, and that
> > the host CPU has the RAPL feature we're using.
>
> Checking for the host cpu and RAPL enable is fine and done. 
> 
> But checking for guest CPU is confusing me. 
> The RAPL feature is enable only with KVM enable. 
> This means "-cpu" can only be "host" or its derivative that essentially 
> copy the host CPU definition, no?

KVM can use any named CPU.

> That means if we are already checking the host cpu we don't need to do 
> anything for the guest, do we ?

When I first wrote this I though it would be as simple as checknig a
CPUID feature flag. That appears to not be the case, however, as Linux
is just checking for various CPU models directly. With that in mind
perhaps we should just check of the guest CPU model vendor
== CPUID_VENDOR_INTEL and leave it at that.

eg, create an error if running an AMD CPU such as $QEMU -cpu EPYC

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-03-04 14:48       ` Daniel P. Berrangé
@ 2024-03-05 13:25         ` Anthony Harivel
  2024-03-05 13:57           ` Daniel P. Berrangé
  0 siblings, 1 reply; 32+ messages in thread
From: Anthony Harivel @ 2024-03-05 13:25 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: pbonzini, mtosatti, qemu-devel, vchundur

Daniel P. Berrangé, Mar 04, 2024 at 15:48:
> On Mon, Mar 04, 2024 at 03:41:02PM +0100, Anthony Harivel wrote:
> > 
> > Hi Daniel,
> > 
> > > > +        if (s->msr_energy.enable == true) {
> > >
> > > This looks to be where we need to check that both the host CPU
> > > vendor is intel, and the guest CPU vendor is intel, and that
> > > the host CPU has the RAPL feature we're using.
> >
> > Checking for the host cpu and RAPL enable is fine and done. 
> > 
> > But checking for guest CPU is confusing me. 
> > The RAPL feature is enable only with KVM enable. 
> > This means "-cpu" can only be "host" or its derivative that essentially 
> > copy the host CPU definition, no?
>
> KVM can use any named CPU.
>
> > That means if we are already checking the host cpu we don't need to do 
> > anything for the guest, do we ?
>
> When I first wrote this I though it would be as simple as checknig a
> CPUID feature flag. That appears to not be the case, however, as Linux
> is just checking for various CPU models directly. With that in mind
> perhaps we should just check of the guest CPU model vendor
> == CPUID_VENDOR_INTEL and leave it at that.
>
> eg, create an error if running an AMD CPU such as $QEMU -cpu EPYC

The idea looks good to me. Now the hiccups of this solution is that 
I cannot find a way to reach CPUArchState at this level of code (i.e 
kvm_arch_init() ) with only the MachineState or the KVMState. 
I can only reach the topology with x86_possible_cpu_arch_ids().

CPUArchState struct is holding the cpuid_vendor variables where we can 
use IS_INTEL_CPU() for checking.

Maybe you know the trick that I miss ? 

Regards,
Anthony

>
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-03-05 13:25         ` Anthony Harivel
@ 2024-03-05 13:57           ` Daniel P. Berrangé
  0 siblings, 0 replies; 32+ messages in thread
From: Daniel P. Berrangé @ 2024-03-05 13:57 UTC (permalink / raw)
  To: Anthony Harivel; +Cc: pbonzini, mtosatti, qemu-devel, vchundur

On Tue, Mar 05, 2024 at 02:25:09PM +0100, Anthony Harivel wrote:
> Daniel P. Berrangé, Mar 04, 2024 at 15:48:
> > On Mon, Mar 04, 2024 at 03:41:02PM +0100, Anthony Harivel wrote:
> > > 
> > > Hi Daniel,
> > > 
> > > > > +        if (s->msr_energy.enable == true) {
> > > >
> > > > This looks to be where we need to check that both the host CPU
> > > > vendor is intel, and the guest CPU vendor is intel, and that
> > > > the host CPU has the RAPL feature we're using.
> > >
> > > Checking for the host cpu and RAPL enable is fine and done. 
> > > 
> > > But checking for guest CPU is confusing me. 
> > > The RAPL feature is enable only with KVM enable. 
> > > This means "-cpu" can only be "host" or its derivative that essentially 
> > > copy the host CPU definition, no?
> >
> > KVM can use any named CPU.
> >
> > > That means if we are already checking the host cpu we don't need to do 
> > > anything for the guest, do we ?
> >
> > When I first wrote this I though it would be as simple as checknig a
> > CPUID feature flag. That appears to not be the case, however, as Linux
> > is just checking for various CPU models directly. With that in mind
> > perhaps we should just check of the guest CPU model vendor
> > == CPUID_VENDOR_INTEL and leave it at that.
> >
> > eg, create an error if running an AMD CPU such as $QEMU -cpu EPYC
> 
> The idea looks good to me. Now the hiccups of this solution is that 
> I cannot find a way to reach CPUArchState at this level of code (i.e 
> kvm_arch_init() ) with only the MachineState or the KVMState. 
> I can only reach the topology with x86_possible_cpu_arch_ids().
> 
> CPUArchState struct is holding the cpuid_vendor variables where we can 
> use IS_INTEL_CPU() for checking.
> 
> Maybe you know the trick that I miss ?

I think perhaps you can do a check in kvm_cpu_realizefn() from
target/i386/kvm/kvm-cpu.c, as you have CPUX86State state which
is what IS_INTEL_CPU wants.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-01-25  7:22 ` [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu Anthony Harivel
  2024-01-29 19:29   ` Daniel P. Berrangé
  2024-01-30  9:13   ` Daniel P. Berrangé
@ 2024-01-30  9:39   ` Daniel P. Berrangé
  2024-03-12 11:21     ` Anthony Harivel
  2 siblings, 1 reply; 32+ messages in thread
From: Daniel P. Berrangé @ 2024-01-30  9:39 UTC (permalink / raw)
  To: Anthony Harivel; +Cc: pbonzini, mtosatti, qemu-devel, vchundur

On Thu, Jan 25, 2024 at 08:22:14AM +0100, Anthony Harivel wrote:
> Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
> interface (Running Average Power Limit) for advertising the accumulated
> energy consumption of various power domains (e.g. CPU packages, DRAM,
> etc.).
> 
> The consumption is reported via MSRs (model specific registers) like
> MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
> 64 bits registers that represent the accumulated energy consumption in
> micro Joules. They are updated by microcode every ~1ms.
> 
> For now, KVM always returns 0 when the guest requests the value of
> these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
> these MSRs dynamically in userspace.
> 
> To limit the amount of system calls for every MSR call, create a new
> thread in QEMU that updates the "virtual" MSR values asynchronously.
> 
> Each vCPU has its own vMSR to reflect the independence of vCPUs. The
> thread updates the vMSR values with the ratio of energy consumed of
> the whole physical CPU package the vCPU thread runs on and the
> thread's utime and stime values.
> 
> All other non-vCPU threads are also taken into account. Their energy
> consumption is evenly distributed among all vCPUs threads running on
> the same physical CPU package.
> 
> To overcome the problem that reading the RAPL MSR requires priviliged
> access, a socket communication between QEMU and the qemu-vmsr-helper is
> mandatory. You can specified the socket path in the parameter.
> 
> This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock
> 
> Actual limitation:
> - Works only on Intel host CPU because AMD CPUs are using different MSR
>   adresses.
> 
> - Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at
>   the moment.
> 
> Signed-off-by: Anthony Harivel <aharivel@redhat.com>
> ---
>  accel/kvm/kvm-all.c           |  27 +++
>  docs/specs/index.rst          |   1 +
>  docs/specs/rapl-msr.rst       | 133 +++++++++++++
>  include/sysemu/kvm_int.h      |  17 ++
>  target/i386/cpu.h             |   8 +
>  target/i386/kvm/kvm.c         | 348 ++++++++++++++++++++++++++++++++++
>  target/i386/kvm/meson.build   |   1 +
>  target/i386/kvm/vmsr_energy.c | 295 ++++++++++++++++++++++++++++
>  target/i386/kvm/vmsr_energy.h |  87 +++++++++
>  9 files changed, 917 insertions(+)
>  create mode 100644 docs/specs/rapl-msr.rst
>  create mode 100644 target/i386/kvm/vmsr_energy.c
>  create mode 100644 target/i386/kvm/vmsr_energy.h
> 

> @@ -2509,6 +2558,265 @@ static void register_smram_listener(Notifier *n, void *unused)
>                                   &smram_address_space, 1, "kvm-smram");
>  }
>  
> +static void *kvm_msr_energy_thread(void *data)
> +{
> +    KVMState *s = data;
> +    struct KVMMsrEnergy *vmsr = &s->msr_energy;
> +
> +    g_autofree package_energy_stat *pkg_stat = NULL;
> +    g_autofree thread_stat *thd_stat = NULL;
> +    g_autofree pid_t *thread_ids = NULL;
> +    g_autofree CPUState *cpu = NULL;
> +    unsigned int maxpkgs, maxcpus, maxticks;
> +    g_autofree unsigned int *vpkgs_energy_stat = NULL;
> +    unsigned int num_threads = 0;
> +    unsigned int tmp_num_threads = 0;
> +    pid_t pid;
> +
> +    X86CPUTopoIDs topo_ids;
> +
> +
> +    rcu_register_thread();
> +
> +    /* Get QEMU PID*/
> +    pid = getpid();
> +
> +    /* Nb of CPUS per packages */
> +    maxcpus = vmsr_get_maxcpus(0);
> +
> +    /* Nb of Physical Packages on the system */
> +    maxpkgs = vmsr_get_max_physical_package(maxcpus);

This function can fail so this needs to be checked & reported.

> +
> +    /* Those MSR values should not change as well */
> +    vmsr->msr_unit  = vmsr_read_msr(MSR_RAPL_POWER_UNIT, 0, pid,
> +                                    s->msr_energy.socket_path);
> +    vmsr->msr_limit = vmsr_read_msr(MSR_PKG_POWER_LIMIT, 0, pid,
> +                                    s->msr_energy.socket_path);
> +    vmsr->msr_info  = vmsr_read_msr(MSR_PKG_POWER_INFO, 0, pid,
> +                                    s->msr_energy.socket_path);

This function can fail for a variety of reasons, most especially if someone
gave an incorrect socket path, or if the daemon is not running. This is not
getting diagnosed, and even if we try to report it here, we're in a background
thread at this point.

I think we need to connect and report errors before even starting this
thread, so that QEMU startup gets aborted upon configuration error.

> +
> +    /* Allocate memory for each package energy status */
> +    pkg_stat = (package_energy_stat *)
> +        g_new0(package_energy_stat, maxpkgs);
> +
> +    /* Pre-allocate memory for thread stats */
> +    thd_stat = g_new0(thread_stat, 1);
> +
> +    /* Pre-allocate memory for holding Virtual Package Energy counter */
> +    vpkgs_energy_stat = g_new0(unsigned int, vmsr->sockets);
> +
> +    /*
> +     * Max numbers of ticks per package
> +     * time in second * number of ticks/second * Number of cores / package
> +     * ex: for 100 ticks/second/CPU, 12 CPUs per Package gives 1200 ticks max
> +     */
> +    maxticks = (MSR_ENERGY_THREAD_SLEEP_US / 1000000)
> +                    * sysconf(_SC_CLK_TCK) * maxcpus;
> +
> +    while (true) {
> +        /* Get all qemu threads id */
> +        thread_ids = vmsr_get_thread_ids(pid, &num_threads);
> +
> +        if (thread_ids == NULL) {
> +            goto clean;
> +        }
> +
> +        if (tmp_num_threads < num_threads) {
> +            thd_stat = g_renew(thread_stat, thd_stat, num_threads);
> +        }
> +
> +        tmp_num_threads = num_threads;
> +
> +        /* Populate all the thread stats */
> +        for (int i = 0; i < num_threads; i++) {
> +            thd_stat[i].utime = g_new0(unsigned long long, 2);
> +            thd_stat[i].stime = g_new0(unsigned long long, 2);
> +            thd_stat[i].thread_id = thread_ids[i];
> +            vmsr_read_thread_stat(&thd_stat[i], pid, 0);

It is non-obvious that the 3rd parameter here is an index into
the utime & stime array. This function would be saner to review
if called as:

            vmsr_read_thread_stat(pid,
	                          thd_stat[i].thread_id,
	                          &thd_stat[i].utime[0],
	                          &thd_stat[i].stime[0],
	                          &thd_stat[i].cpu_id);

so we see what are input parameters and what are output parameters.

Also this method can fail, eg if the thread has exited already,
so we need to take that into account and stop trying to get info
for that thread in later code. eg by setting 'thread_id' to 0
and then skipping any thread_id == 0 later.


> +            thd_stat[i].numa_node_id = numa_node_of_cpu(thd_stat[i].cpu_id);
> +        }
> +
> +        /* Retrieve all packages power plane energy counter */
> +        for (int i = 0; i <= maxpkgs; i++) {
> +            for (int j = 0; j < num_threads; j++) {
> +                /*
> +                 * Use the first thread we found that ran on the CPU
> +                 * of the package to read the packages energy counter
> +                 */
> +                if (thd_stat[j].numa_node_id == i) {

'i' is a CPU ID value, while 'numa_node_id' is a NUMA node ID value.
I don't think it is semantically valid to compare them for equality.

I'm not sure the NUMA node is even relevant, since IIUC from the docs
earlier, the power values are scoped per package, which would mean per
CPU socket.

> +                    pkg_stat[i].e_start =
> +                    vmsr_read_msr(MSR_PKG_ENERGY_STATUS, i, pid,
> +                                  s->msr_energy.socket_path);

> +                    break;
> +                }
> +            }
> +        }
> +
> +        /* Sleep a short period while the other threads are working */
> +        usleep(MSR_ENERGY_THREAD_SLEEP_US);
> +
> +        /*
> +         * Retrieve all packages power plane energy counter
> +         * Calculate the delta of all packages
> +         */
> +        for (int i = 0; i <= maxpkgs; i++) {
> +            for (int j = 0; j < num_threads; j++) {
> +                /*
> +                 * Use the first thread we found that ran on the CPU
> +                 * of the package to read the packages energy counter
> +                 */
> +                if (thd_stat[j].numa_node_id == i) {
> +                    pkg_stat[i].e_end =
> +                       vmsr_read_msr(MSR_PKG_ENERGY_STATUS,
> +                                     thd_stat[j].cpu_id,
> +                                     thd_stat[j].thread_id,
> +                                     s->msr_energy.socket_path);
> +                    /*
> +                     * Prevent the case we have migrate the VM
> +                     * during the sleep period or any other cases
> +                     * were energy counter might be lower after
> +                     * the sleep.
> +                     */
> +                    if (pkg_stat[i].e_end > pkg_stat[i].e_start) {
> +                        pkg_stat[i].e_delta =
> +                            pkg_stat[i].e_end - pkg_stat[i].e_start;
> +                    } else {
> +                        pkg_stat[i].e_delta = 0;
> +                    }
> +                    break;
> +                }
> +            }
> +        }
> +
> +        /* Delta of ticks spend by each thread between the sample */
> +        for (int i = 0; i < num_threads; i++) {
> +            if (vmsr_read_thread_stat(&thd_stat[i], pid, 1) != 0) {

Here we fetch the thread 'cpu_id' again. What is expected behaviour
if 'cpu_id' has changed since we read it earlier ? The delta ticks
calculation will be OK is cpu_is is still within the same package
as earlier, but if it moved to another package it is not valid.



> +                /*
> +                 * We don't count the dead thread
> +                 * i.e threads that existed before the sleep
> +                 * and not anymore
> +                 */
> +                thd_stat[i].delta_ticks = 0;
> +            } else {
> +                vmsr_delta_ticks(thd_stat, i);
> +            }
> +        }
> +
> +        /*
> +         * Identify the vCPU threads
> +         * Calculate the Number of vCPU per package
> +         */
> +        CPU_FOREACH(cpu) {
> +            for (int i = 0; i < num_threads; i++) {
> +                if (cpu->thread_id == thd_stat[i].thread_id) {
> +                    thd_stat[i].is_vcpu = true;
> +                    thd_stat[i].vcpu_id = cpu->cpu_index;
> +                    pkg_stat[thd_stat[i].numa_node_id].nb_vcpu++;

A numa node isn't a package AFAICT.

> +                    thd_stat[i].acpi_id = kvm_arch_vcpu_id(cpu);
> +                    break;
> +                }
> +            }
> +        }
> +
> +        /* Retrieve the virtual package number of each vCPU */
> +        for (int i = 0; i < vmsr->x86_cpu_list->len; i++) {
> +            for (int j = 0; j < num_threads; j++) {
> +                if ((thd_stat[j].acpi_id == vmsr->x86_cpu_list->cpus[i].arch_id)
> +                    && (thd_stat[j].is_vcpu == true)) {
> +                    x86_topo_ids_from_apicid(thd_stat[j].acpi_id,
> +                        &vmsr->topo_info, &topo_ids);
> +                    thd_stat[j].vpkg = topo_ids.pkg_id;
> +                }
> +            }
> +        }
> +
> +        /* Calculate the total energy of all non-vCPU thread */
> +        for (int i = 0; i < num_threads; i++) {
> +            double temp;
> +            if ((thd_stat[i].is_vcpu != true) &&
> +                (thd_stat[i].delta_ticks > 0)) {
> +                temp = vmsr_get_ratio(pkg_stat, thd_stat, maxticks, i);
> +                pkg_stat[thd_stat[i].numa_node_id].e_ratio
> +                    += (uint64_t)lround(temp);
> +            }
> +        }
> +
> +        /* Calculate the ratio per non-vCPU thread of each package */
> +        for (int i = 0; i <= maxpkgs; i++) {
> +            if (pkg_stat[i].nb_vcpu > 0) {
> +                pkg_stat[i].e_ratio = pkg_stat[i].e_ratio / pkg_stat[i].nb_vcpu;
> +            }
> +        }
> +
> +        /*
> +         * Calculate the energy for each Package:
> +         * Energy Package = sum of each vCPU energy that belongs to the package
> +         */
> +        for (int i = 0; i < num_threads; i++) {
> +            double temp;
> +
> +            if ((thd_stat[i].is_vcpu == true) && \
> +                    (thd_stat[i].delta_ticks > 0)) {
> +                temp = vmsr_get_ratio(pkg_stat, thd_stat, maxticks, i);
> +
> +                vpkgs_energy_stat[thd_stat[i].vpkg] += (uint64_t)lround(temp);
> +                vpkgs_energy_stat[thd_stat[i].vpkg] +=
> +                    pkg_stat[thd_stat[i].numa_node_id].e_ratio;
> +            }
> +        }
> +
> +        /*
> +         * Finally populate the vmsr register of each vCPU with the total
> +         * package value to emulate the real hardware where each CPU return the
> +         * value of the package it belongs.
> +         */
> +        for (int i = 0; i < num_threads; i++) {
> +            if ((thd_stat[i].is_vcpu == true) && \
> +                    (thd_stat[i].delta_ticks > 0)) {
> +                vmsr->msr_value[thd_stat[i].vcpu_id] = \
> +                                            vpkgs_energy_stat[thd_stat[i].vpkg];
> +          }
> +        }
> +
> +        /* Zero out the memory */
> +        for (int i = 0; i < num_threads; i++) {
> +            memset(thd_stat[i].utime, 0, 2 * sizeof(unsigned long long));
> +            memset(thd_stat[i].stime, 0, 2 * sizeof(unsigned long long));
> +        }
> +        memset(thd_stat, 0, num_threads * sizeof(thread_stat));
> +        memset(thread_ids, 0, sizeof(pid_t));
> +    }
> +
> +clean:
> +    rcu_unregister_thread();
> +    return NULL;
> +}

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-01-30  9:39   ` Daniel P. Berrangé
@ 2024-03-12 11:21     ` Anthony Harivel
  2024-03-12 15:49       ` Daniel P. Berrangé
  0 siblings, 1 reply; 32+ messages in thread
From: Anthony Harivel @ 2024-03-12 11:21 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: pbonzini, mtosatti, qemu-devel, vchundur


Hi Daniel, Paolo,

Here my last questions before wrapping up and send v4, or maybe call off
my attempt to add RAPL interface in QEMU.


Daniel P. Berrangé, Jan 30, 2024 at 10:39:
> > +    rcu_register_thread();
> > +
> > +    /* Get QEMU PID*/
> > +    pid = getpid();
> > +
> > +    /* Nb of CPUS per packages */
> > +    maxcpus = vmsr_get_maxcpus(0);
> > +
> > +    /* Nb of Physical Packages on the system */
> > +    maxpkgs = vmsr_get_max_physical_package(maxcpus);
>
> This function can fail so this needs to be checked & reported.
>
> > +
> > +    /* Those MSR values should not change as well */
> > +    vmsr->msr_unit  = vmsr_read_msr(MSR_RAPL_POWER_UNIT, 0, pid,
> > +                                    s->msr_energy.socket_path);
> > +    vmsr->msr_limit = vmsr_read_msr(MSR_PKG_POWER_LIMIT, 0, pid,
> > +                                    s->msr_energy.socket_path);
> > +    vmsr->msr_info  = vmsr_read_msr(MSR_PKG_POWER_INFO, 0, pid,
> > +                                    s->msr_energy.socket_path);
>
> This function can fail for a variety of reasons, most especially if someone
> gave an incorrect socket path, or if the daemon is not running. This is not
> getting diagnosed, and even if we try to report it here, we're in a background
> thread at this point.
>
> I think we need to connect and report errors before even starting this
> thread, so that QEMU startup gets aborted upon configuration error.
>

Fair enough. Would it be ok to do the sanity check before 
rcu_register_thread() and "return NULL;" in case of error or would you 
prefer me to check all of this before even calling the 
qemu_thread_create() ? 

> > +        /* Populate all the thread stats */
> > +        for (int i = 0; i < num_threads; i++) {
> > +            thd_stat[i].utime = g_new0(unsigned long long, 2);
> > +            thd_stat[i].stime = g_new0(unsigned long long, 2);
> > +            thd_stat[i].thread_id = thread_ids[i];
> > +            vmsr_read_thread_stat(&thd_stat[i], pid, 0);
>
> It is non-obvious that the 3rd parameter here is an index into
> the utime & stime array. This function would be saner to review
> if called as:
>
>             vmsr_read_thread_stat(pid,
> 	                          thd_stat[i].thread_id,
> 	                          &thd_stat[i].utime[0],
> 	                          &thd_stat[i].stime[0],
> 	                          &thd_stat[i].cpu_id);
>
> so we see what are input parameters and what are output parameters.
>
> Also this method can fail, eg if the thread has exited already,
> so we need to take that into account and stop trying to get info
> for that thread in later code. eg by setting 'thread_id' to 0
> and then skipping any thread_id == 0 later.
>
>

Good point. I'll rework the function and return "thread_id" to 0 in 
case of failure in order to test it later on. 

> > +            thd_stat[i].numa_node_id = numa_node_of_cpu(thd_stat[i].cpu_id);
> > +        }
> > +
> > +        /* Retrieve all packages power plane energy counter */
> > +        for (int i = 0; i <= maxpkgs; i++) {
> > +            for (int j = 0; j < num_threads; j++) {
> > +                /*
> > +                 * Use the first thread we found that ran on the CPU
> > +                 * of the package to read the packages energy counter
> > +                 */
> > +                if (thd_stat[j].numa_node_id == i) {
>
> 'i' is a CPU ID value, while 'numa_node_id' is a NUMA node ID value.
> I don't think it is semantically valid to compare them for equality.
>
> I'm not sure the NUMA node is even relevant, since IIUC from the docs
> earlier, the power values are scoped per package, which would mean per
> CPU socket.
>

'i' here is the package number on the host. 
I'm using functions of libnuma to populate the maxpkgs of the host. 
I tested this on different Intel CPU with multiple packages and this 
has always returned the good number of packages. A false positive ?

So here I'm checking if the thread has run on the package number 'i'. 
I populate 'numa_node_id' with numa_node_of_cpu().

I did not wanted to reinvent the wheel and the only lib that was talking 
about "node" was libnuma.

Maybe I'm wrong assuming that a "node" (defined as an area where all 
memory has the same speed as seen from a particular CPU) could lead me 
to the packages number ?

And this is what I see you wrote below: 
"A numa node isn't a package AFAICT."


Regards,
Anthony



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-03-12 11:21     ` Anthony Harivel
@ 2024-03-12 15:49       ` Daniel P. Berrangé
  2024-03-13 10:48         ` Anthony Harivel
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel P. Berrangé @ 2024-03-12 15:49 UTC (permalink / raw)
  To: Anthony Harivel; +Cc: pbonzini, mtosatti, qemu-devel, vchundur

On Tue, Mar 12, 2024 at 12:21:14PM +0100, Anthony Harivel wrote:

> Daniel P. Berrangé, Jan 30, 2024 at 10:39:
> > > +    rcu_register_thread();
> > > +
> > > +    /* Get QEMU PID*/
> > > +    pid = getpid();
> > > +
> > > +    /* Nb of CPUS per packages */
> > > +    maxcpus = vmsr_get_maxcpus(0);
> > > +
> > > +    /* Nb of Physical Packages on the system */
> > > +    maxpkgs = vmsr_get_max_physical_package(maxcpus);
> >
> > This function can fail so this needs to be checked & reported.
> >
> > > +
> > > +    /* Those MSR values should not change as well */
> > > +    vmsr->msr_unit  = vmsr_read_msr(MSR_RAPL_POWER_UNIT, 0, pid,
> > > +                                    s->msr_energy.socket_path);
> > > +    vmsr->msr_limit = vmsr_read_msr(MSR_PKG_POWER_LIMIT, 0, pid,
> > > +                                    s->msr_energy.socket_path);
> > > +    vmsr->msr_info  = vmsr_read_msr(MSR_PKG_POWER_INFO, 0, pid,
> > > +                                    s->msr_energy.socket_path);
> >
> > This function can fail for a variety of reasons, most especially if someone
> > gave an incorrect socket path, or if the daemon is not running. This is not
> > getting diagnosed, and even if we try to report it here, we're in a background
> > thread at this point.
> >
> > I think we need to connect and report errors before even starting this
> > thread, so that QEMU startup gets aborted upon configuration error.
> >
> 
> Fair enough. Would it be ok to do the sanity check before 
> rcu_register_thread() and "return NULL;" in case of error or would you 
> prefer me to check all of this before even calling the 
> qemu_thread_create() ?

I think it is best to initialize in kvm_msr_energy_thread_init(),
prior to thread creation, as that avoids some complexity cleaning
up and reporting errors.

> > > +            thd_stat[i].numa_node_id = numa_node_of_cpu(thd_stat[i].cpu_id);
> > > +        }
> > > +
> > > +        /* Retrieve all packages power plane energy counter */
> > > +        for (int i = 0; i <= maxpkgs; i++) {
> > > +            for (int j = 0; j < num_threads; j++) {
> > > +                /*
> > > +                 * Use the first thread we found that ran on the CPU
> > > +                 * of the package to read the packages energy counter
> > > +                 */
> > > +                if (thd_stat[j].numa_node_id == i) {
> >
> > 'i' is a CPU ID value, while 'numa_node_id' is a NUMA node ID value.
> > I don't think it is semantically valid to compare them for equality.
> >
> > I'm not sure the NUMA node is even relevant, since IIUC from the docs
> > earlier, the power values are scoped per package, which would mean per
> > CPU socket.
> >
> 
> 'i' here is the package number on the host.

Opps, yes, 'maxpkgs' is iterating over CPU /socket/ ID numbers

The point still stands though. NUMA node ID numbers are not
guaranteed to be the same as socket ID numbers. Very often
then will be the same (which makes it annoying to test as it
is easy to not realize the difference), but we can't rely on
that.

> I'm using functions of libnuma to populate the maxpkgs of the host. 
> I tested this on different Intel CPU with multiple packages and this 
> has always returned the good number of packages. A false positive ?

maxpkgs comes from vmsr_get_max_physical_package() which you're
reading from sysfs, rather than libnuma.

> So here I'm checking if the thread has run on the package number 'i'. 
> I populate 'numa_node_id' with numa_node_of_cpu().
> 
> I did not wanted to reinvent the wheel and the only lib that was talking 
> about "node" was libnuma.

I'm not actually convinced we need to use libnuma at all. IIUC, you're
just trying to track all CPUs within the same physical socket (package).
I don't think we need to care about NUMA nodes to do that tracking.

> Maybe I'm wrong assuming that a "node" (defined as an area where all 
> memory has the same speed as seen from a particular CPU) could lead me 
> to the packages number ?

Historically you could have multiple sockets in the same NUMA node
ie a m:1 mapping.

These days with AMD sockets, you can have 1 socket compromising
many NUMA nodes, as individual dies within a socket are each their
own NUMA node. So a 1:m mapping

On Intel I think it is still typical to have 1 socket per numa
node, but again I don't think we can rely on that 1:1 mapping.

Fortunately I don't think it matters, since it looks like you
don't really need to track NUMA nodes, only sockets (phnysical
package IDs)

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-03-12 15:49       ` Daniel P. Berrangé
@ 2024-03-13 10:48         ` Anthony Harivel
  2024-03-13 11:04           ` Daniel P. Berrangé
  0 siblings, 1 reply; 32+ messages in thread
From: Anthony Harivel @ 2024-03-13 10:48 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: pbonzini, mtosatti, qemu-devel, vchundur

Hi Daniel,

Daniel P. Berrangé, Mar 12, 2024 at 16:49:

> The point still stands though. NUMA node ID numbers are not
> guaranteed to be the same as socket ID numbers. Very often
> then will be the same (which makes it annoying to test as it
> is easy to not realize the difference), but we can't rely on
> that.
>
> > I'm using functions of libnuma to populate the maxpkgs of the host. 
> > I tested this on different Intel CPU with multiple packages and this 
> > has always returned the good number of packages. A false positive ?
>
> maxpkgs comes from vmsr_get_max_physical_package() which you're
> reading from sysfs, rather than libnuma.
>
> > So here I'm checking if the thread has run on the package number 'i'. 
> > I populate 'numa_node_id' with numa_node_of_cpu().
> > 
> > I did not wanted to reinvent the wheel and the only lib that was talking 
> > about "node" was libnuma.
>
> I'm not actually convinced we need to use libnuma at all. IIUC, you're
> just trying to track all CPUs within the same physical socket (package).
> I don't think we need to care about NUMA nodes to do that tracking.
>

Alright, having a deeper look I'm actually using NUMA for 2 info:

- How many cpu per Package: this helps me calculate the ratio.

- To whom package the cpu belongs: to calculate the ratio with the right 
  package energy counter.

Without libnuma, I'm bit confused on how to handle this. 

Should I parse /sys/bus/node/devices/node* to know how many packages ?
Should I parse /sys/bus/node/devices/node0/cpu0/topology/core_cpus_list 
to handle which cpu belongs to which package ?

Would that be too cumbusome for the user to enter the detail about how
many packages and how many cpu per pakages ? 

i.e: 
-kvm,rapl=true,maxpkgs=2,cpupkgs=8,rapl-helper-socket=/path/sock.sock

> > Maybe I'm wrong assuming that a "node" (defined as an area where all 
> > memory has the same speed as seen from a particular CPU) could lead me 
> > to the packages number ?
>
> Historically you could have multiple sockets in the same NUMA node
> ie a m:1 mapping.
>
> These days with AMD sockets, you can have 1 socket compromising
> many NUMA nodes, as individual dies within a socket are each their
> own NUMA node. So a 1:m mapping
>
> On Intel I think it is still typical to have 1 socket per numa
> node, but again I don't think we can rely on that 1:1 mapping.
>
> Fortunately I don't think it matters, since it looks like you
> don't really need to track NUMA nodes, only sockets (phnysical
> package IDs)
>

Very informative, thanks !

> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Regards,
Anthony




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-03-13 10:48         ` Anthony Harivel
@ 2024-03-13 11:04           ` Daniel P. Berrangé
  2024-03-14  8:26             ` Anthony Harivel
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel P. Berrangé @ 2024-03-13 11:04 UTC (permalink / raw)
  To: Anthony Harivel; +Cc: pbonzini, mtosatti, qemu-devel, vchundur

On Wed, Mar 13, 2024 at 11:48:19AM +0100, Anthony Harivel wrote:
> Hi Daniel,
> 
> Daniel P. Berrangé, Mar 12, 2024 at 16:49:
> 
> > The point still stands though. NUMA node ID numbers are not
> > guaranteed to be the same as socket ID numbers. Very often
> > then will be the same (which makes it annoying to test as it
> > is easy to not realize the difference), but we can't rely on
> > that.
> >
> > > I'm using functions of libnuma to populate the maxpkgs of the host. 
> > > I tested this on different Intel CPU with multiple packages and this 
> > > has always returned the good number of packages. A false positive ?
> >
> > maxpkgs comes from vmsr_get_max_physical_package() which you're
> > reading from sysfs, rather than libnuma.
> >
> > > So here I'm checking if the thread has run on the package number 'i'. 
> > > I populate 'numa_node_id' with numa_node_of_cpu().
> > > 
> > > I did not wanted to reinvent the wheel and the only lib that was talking 
> > > about "node" was libnuma.
> >
> > I'm not actually convinced we need to use libnuma at all. IIUC, you're
> > just trying to track all CPUs within the same physical socket (package).
> > I don't think we need to care about NUMA nodes to do that tracking.
> >
> 
> Alright, having a deeper look I'm actually using NUMA for 2 info:
> 
> - How many cpu per Package: this helps me calculate the ratio.
> 
> - To whom package the cpu belongs: to calculate the ratio with the right 
>   package energy counter.
> 
> Without libnuma, I'm bit confused on how to handle this. 
> 
> Should I parse /sys/bus/node/devices/node* to know how many packages ?
> Should I parse /sys/bus/node/devices/node0/cpu0/topology/core_cpus_list 
> to handle which cpu belongs to which package ?

You don't need to access it via the /node/ hierarchy

The canonical path for CPUs would be

  /sys/devices/system/cpu/cpuNNN/topology

The core_cpus_list file is giving you hyper-thread siblings within
a core, which I don't think is what you want.

If you're after discrete physical packages, then 'package_cpus_list'
gives you all CPUs within a physical socket (package) I believe.

> Would that be too cumbusome for the user to enter the detail about how
> many packages and how many cpu per pakages ? 
> 
> i.e: 
> -kvm,rapl=true,maxpkgs=2,cpupkgs=8,rapl-helper-socket=/path/sock.sock

That won't cope with asymmetrical CPU configurations, so I think it
is preferrable to read the info from sysfs.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-03-13 11:04           ` Daniel P. Berrangé
@ 2024-03-14  8:26             ` Anthony Harivel
  2024-03-14  8:55               ` Daniel P. Berrangé
  0 siblings, 1 reply; 32+ messages in thread
From: Anthony Harivel @ 2024-03-14  8:26 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: pbonzini, mtosatti, qemu-devel, vchundur

Hi Daniel,

> You don't need to access it via the /node/ hierarchy
>
> The canonical path for CPUs would be
>
>   /sys/devices/system/cpu/cpuNNN/topology
>
> The core_cpus_list file is giving you hyper-thread siblings within
> a core, which I don't think is what you want.
>
> If you're after discrete physical packages, then 'package_cpus_list'
> gives you all CPUs within a physical socket (package) I believe.
>

Yes, this could work.
However, on laptop, I've got: 
cat package_cpus_list
0-11

Where on server: 
package_cpus_list
0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46

I asked my teammate: always the same results. This is I guess due to 
either a difference in the kernel version or how the kernel is handling 
the case where there is only one package, versus the case with multiple 
packages. 

Anyway, writing a C function to handle both cases might not be easy. 

The other solution would be to loop through 
/sys/devices/system/cpu/cpuNNN/ and update a table of integer of a fixed 
size (*) where I increment table[0] for package_0 when I encounter a CPU 
that belongs to package_0 and so on. Reading 
cpuNNN/topology/physical_package_id is giving to whom package the cpu 
belongs to.

This is a bit tedious but a safer solution, I think.

(*): Maybe dynamic allocation is better ?

Thanks again for your guidance.

Regards,
Anthony

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu
  2024-03-14  8:26             ` Anthony Harivel
@ 2024-03-14  8:55               ` Daniel P. Berrangé
  0 siblings, 0 replies; 32+ messages in thread
From: Daniel P. Berrangé @ 2024-03-14  8:55 UTC (permalink / raw)
  To: Anthony Harivel; +Cc: pbonzini, mtosatti, qemu-devel, vchundur

On Thu, Mar 14, 2024 at 09:26:53AM +0100, Anthony Harivel wrote:
> 
> Hi Daniel,
> 
> 
> > You don't need to access it via the /node/ hierarchy
> >
> > The canonical path for CPUs would be
> >
> >   /sys/devices/system/cpu/cpuNNN/topology
> >
> > The core_cpus_list file is giving you hyper-thread siblings within
> > a core, which I don't think is what you want.
> >
> > If you're after discrete physical packages, then 'package_cpus_list'
> > gives you all CPUs within a physical socket (package) I believe.
> >
> 
> Yes, this could work.
> However, on laptop, I've got: 
> cat package_cpus_list
> 0-11
> 
> Where on server: 
> package_cpus_list
> 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
> 
> I asked my teammate: always the same results. This is I guess due to 
> either a difference in the kernel version or how the kernel is handling 
> the case where there is only one package, versus the case with multiple 
> packages.

Both are the same data format - it is a list of ranges. In the
laptop example, there's only a single range to parse. In the
server example there are many ranges but since each range is
only 1 cpu, it has collapsed the ranges to the single CPU id.

> Anyway, writing a C function to handle both cases might not be easy.

Approximately

  * Read the whole file with g_get_file_contents
  * Use  g_strsplit(data, ",", 0) to get a list of ranges
  * Iterate over the return list of ranges and g_strsplit(range, "-", 2);
      - The returned list should be either a single element (if there was
         no '-'), or a pair of elements (if there was a N-M)


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2024-03-14  8:56 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-25  7:22 [PATCH v3 0/3] Add support for the RAPL MSRs series Anthony Harivel
2024-01-25  7:22 ` [PATCH v3 1/3] qio: add support for SO_PEERCRED for socket channel Anthony Harivel
2024-01-25 16:37   ` Daniel P. Berrangé
2024-01-29 19:25     ` Paolo Bonzini
2024-01-29 19:30       ` Daniel P. Berrangé
2024-01-25  7:22 ` [PATCH v3 2/3] tools: build qemu-vmsr-helper Anthony Harivel
2024-01-29 18:53   ` Daniel P. Berrangé
2024-01-29 19:33     ` Paolo Bonzini
2024-01-29 19:45       ` Daniel P. Berrangé
2024-01-29 19:53         ` Daniel P. Berrangé
2024-01-29 20:21           ` Paolo Bonzini
2024-02-21 13:19         ` Anthony Harivel
2024-02-21 13:47           ` Daniel P. Berrangé
2024-02-21 13:52             ` Anthony Harivel
2024-03-01 11:08       ` Anthony Harivel
2024-01-25  7:22 ` [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu Anthony Harivel
2024-01-29 19:29   ` Daniel P. Berrangé
2024-02-20 14:00     ` Anthony Harivel
2024-02-20 15:00       ` Daniel P. Berrangé
2024-03-05 14:58     ` Anthony Harivel
2024-01-30  9:13   ` Daniel P. Berrangé
2024-03-04 14:41     ` Anthony Harivel
2024-03-04 14:48       ` Daniel P. Berrangé
2024-03-05 13:25         ` Anthony Harivel
2024-03-05 13:57           ` Daniel P. Berrangé
2024-01-30  9:39   ` Daniel P. Berrangé
2024-03-12 11:21     ` Anthony Harivel
2024-03-12 15:49       ` Daniel P. Berrangé
2024-03-13 10:48         ` Anthony Harivel
2024-03-13 11:04           ` Daniel P. Berrangé
2024-03-14  8:26             ` Anthony Harivel
2024-03-14  8:55               ` Daniel P. Berrangé

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).