From: Paolo Bonzini <pbonzini@redhat.com>
To: "Enrico Weigelt, metux IT consult" <lkml@metux.net>,
Jing Zhang <jingzhangos@google.com>, KVM <kvm@vger.kernel.org>,
KVMARM <kvmarm@lists.cs.columbia.edu>,
LinuxMIPS <linux-mips@vger.kernel.org>,
KVMPPC <kvm-ppc@vger.kernel.org>,
LinuxS390 <linux-s390@vger.kernel.org>,
Linuxkselftest <linux-kselftest@vger.kernel.org>,
Marc Zyngier <maz@kernel.org>, James Morse <james.morse@arm.com>,
Julien Thierry <julien.thierry.kdev@gmail.com>,
Suzuki K Poulose <suzuki.poulose@arm.com>,
Will Deacon <will@kernel.org>,
Huacai Chen <chenhuacai@kernel.org>,
Aleksandar Markovic <aleksandar.qemu.devel@gmail.com>,
Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
Paul Mackerras <paulus@ozlabs.org>,
Christian Borntraeger <borntraeger@de.ibm.com>,
Janosch Frank <frankja@linux.ibm.com>,
David Hildenbrand <david@redhat.com>,
Cornelia Huck <cohuck@redhat.com>,
Claudio Imbrenda <imbrenda@linux.ibm.com>,
Sean Christopherson <seanjc@google.com>,
Vitaly Kuznetsov <vkuznets@redhat.com>,
Jim Mattson <jmattson@google.com>,
Peter Shier <pshier@google.com>, Oliver Upton <oupton@google.com>,
David Rientjes <rientjes@google.com>,
Emanuele Giuseppe Esposito <eesposit@redhat.com>,
David Matlack <dmatlack@google.com>,
Ricardo Koller <ricarkol@google.com>,
Krish Sadhukhan <krish.sadhukhan@oracle.com>,
Fuad Tabba <tabba@google.com>
Subject: Re: [PATCH v9 0/5] KVM statistics data fd-based binary interface
Date: Tue, 15 Jun 2021 13:31:10 +0200 [thread overview]
Message-ID: <ad7905f9-8338-382a-b1df-9b3352bbd2f8@redhat.com> (raw)
In-Reply-To: <b86aa6df-5fd7-d705-1688-4d325df6f7d9@metux.net>
On 15/06/21 10:37, Enrico Weigelt, metux IT consult wrote:
> * why is it binary instead of text ? is it so very high volume that
> it really matters ?
The main reason to have a binary format is not the high volume actually
(though it also has its part). Rather, we would really like to include
the schema to make the statistics self-describing. This includes stuff
like whether the unit of measure of a statistic is clock cycles,
nanoseconds, pages or whatnot; having this kind of information in text
leads to awkwardness in the parsers. trace-cmd is another example where
the data consists of a schema followed by binary data.
Text format could certainly be added if there's a usecase, but for
developer use debugfs is usually a suitable replacement.
Last year we tried the opposite direction: we built a one-value-per-file
filesystem with a common API that any subsystem could use (e.g.
providing ethtool stats, /proc/interrupts, etc. in addition to KVM
stats). We started with text, similar to sysfs, with the plan of
extending it to a binary format later. However, other subsystems
expressed very little interest in this, so instead we decided to go with
something that is designed around KVM needs.
Still, the binary format that KVM uses is designed not to be
KVM-specific. If other subsystems want to publish high-volume,
self-describing statistic information, they are welcome to share the
binary format and the code. Perhaps it may make sense in some cases to
have them in sysfs, even (e.g. /sys/kernel/slab/*/.stats). As Greg said
sysfs is currently one value per file, but perhaps that could be changed
if the binary format is an additional way to access the information and
not the only one (not that I'm planning to do it).
> * how will possible future extensions of the telemetry packets work ?
The format includes a schema, so it's possible to add more statistics in
the future. The exact list of statistics varies per architecture and is
not part of the userspace API (obvious caveat: https://xkcd.com/1172/).
> * aren't there other means to get this fd instead of an ioctl() on the
> VM fd ? something more from the outside (eg. sysfs/procfs)
Not yet, but if there's a need it can be added. It'd be plausible to
publish system-wide statistics via a ioctl on /dev/kvm, for example.
We'd have to check how this compares with stuff that is world-readable
in procfs and sysfs, but I don't think there are security concerns in
exposing that.
There's also pidfd_getfd(2) which can be used to pull a VM file
descriptor from another running process. That can be used to avoid the
issue of KVM file descriptors being unnamed.
> * how will that relate to other hypervisors ?
Other hypervisors do not run as part of the Linux kernel (at least they
are not upstream). These statistics only apply to Linux *hosts*, not
guests.
As far as I know, there is no standard that Xen or the proprietary
hypervisors use to communicate their telemetry info to monitoring tools,
and also no standard binary format used by exporters to talk to
monitoring tools. If this format will be adopted by other hypervisors
or any random software, I will be happy.
> Some notes from the operating perspective:
>
> In typical datacenters we've got various monitoring tools that are able
> to catch up lots of data from different sources (especially files). If
> an operator e.g. is interested in something in happening in some file
> (e.g. in /proc of /sys), it's quite trivial - just configure yet another
> probe (maybe some regex for parsing) and done. Automatically fed in his
> $monitoring_solution (e.g. nagios, ELK, Splunk, whatsnot)
... but in practice what you do is you have prebuilt exporters that
talks to $monitoring_solution. Monitoring individual files is the
exception, not the rule. But indeed Libvirt already has I/O and network
statistics and there is an exporter for Prometheus, so we should add
support for this new method as well to both QEMU (exporting the file
descriptor) and Libvirt.
I hope this helps clarifying your doubts!
Paolo
> With your approach, it's not that simple: now the operator needs to
> create (and deploy and manage) a separate agent that somehow receives
> that fd from the VMM, reads and parses that specific binary stream
> and finally pushes it into the monitoring infrastructure. Or the VMM
> writes it into some file, where some monitoring agent can pick it up.
> In any case, not actually trivial from ops perspective.
prev parent reply other threads:[~2021-06-15 11:31 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-14 21:21 [PATCH v9 0/5] KVM statistics data fd-based binary interface Jing Zhang
2021-06-14 21:21 ` [PATCH v9 1/5] KVM: stats: Separate generic stats from architecture specific ones Jing Zhang
2021-06-14 21:21 ` [PATCH v9 2/5] KVM: stats: Add fd-based API to read binary stats data Jing Zhang
2021-06-16 17:12 ` Paolo Bonzini
2021-06-16 18:04 ` Jing Zhang
2021-06-14 21:21 ` [PATCH v9 3/5] KVM: stats: Add documentation for statistics data binary interface Jing Zhang
2021-06-16 15:21 ` Greg KH
2021-06-16 16:59 ` Paolo Bonzini
2021-06-16 18:18 ` Greg KH
2021-06-16 19:45 ` Paolo Bonzini
2021-06-14 21:21 ` [PATCH v9 4/5] KVM: selftests: Add selftest for KVM " Jing Zhang
2021-06-15 8:03 ` Fuad Tabba
2021-06-16 21:35 ` Jing Zhang
2021-06-14 21:21 ` [PATCH v9 5/5] KVM: stats: Remove code duplication for binary and debugfs stats Jing Zhang
2021-06-15 5:25 ` [PATCH v9 0/5] KVM statistics data fd-based binary interface Leon Romanovsky
2021-06-15 7:06 ` Paolo Bonzini
2021-06-15 7:53 ` Leon Romanovsky
2021-06-15 11:03 ` Paolo Bonzini
2021-06-15 13:34 ` Leon Romanovsky
2021-06-15 8:37 ` Enrico Weigelt, metux IT consult
2021-06-15 9:21 ` Greg KH
2021-06-15 11:31 ` Paolo Bonzini [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ad7905f9-8338-382a-b1df-9b3352bbd2f8@redhat.com \
--to=pbonzini@redhat.com \
--cc=aleksandar.qemu.devel@gmail.com \
--cc=borntraeger@de.ibm.com \
--cc=chenhuacai@kernel.org \
--cc=cohuck@redhat.com \
--cc=david@redhat.com \
--cc=dmatlack@google.com \
--cc=eesposit@redhat.com \
--cc=frankja@linux.ibm.com \
--cc=imbrenda@linux.ibm.com \
--cc=james.morse@arm.com \
--cc=jingzhangos@google.com \
--cc=jmattson@google.com \
--cc=julien.thierry.kdev@gmail.com \
--cc=krish.sadhukhan@oracle.com \
--cc=kvm-ppc@vger.kernel.org \
--cc=kvm@vger.kernel.org \
--cc=kvmarm@lists.cs.columbia.edu \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mips@vger.kernel.org \
--cc=linux-s390@vger.kernel.org \
--cc=lkml@metux.net \
--cc=maz@kernel.org \
--cc=oupton@google.com \
--cc=paulus@ozlabs.org \
--cc=pshier@google.com \
--cc=ricarkol@google.com \
--cc=rientjes@google.com \
--cc=seanjc@google.com \
--cc=suzuki.poulose@arm.com \
--cc=tabba@google.com \
--cc=tsbogend@alpha.franken.de \
--cc=vkuznets@redhat.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).