From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: KVM list <kvm@vger.kernel.org>,
Steven Rostedt <rostedt@goodmis.org>,
Christian Borntraeger <borntraeger@de.ibm.com>,
Alex Williamson <alex.williamson@redhat.com>,
Peter Feiner <pfeiner@google.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: "statsfs" API design
Date: Sat, 9 Nov 2019 16:49:52 +0100 [thread overview]
Message-ID: <20191109154952.GA1365674@kroah.com> (raw)
In-Reply-To: <5d6cdcb1-d8ad-7ae6-7351-3544e2fa366d@redhat.com>
On Wed, Nov 06, 2019 at 04:56:25PM +0100, Paolo Bonzini wrote:
> Hi all,
>
> statsfs is a proposal for a new Linux kernel synthetic filesystem, to be
> mounted in /sys/kernel/stats, which exposes subsystem-level statistics
> in sysfs. Reading need not be particularly lightweight, but writing
> must be fast. Therefore, statistics are gathered at a fine-grain level
> in order to avoid locking or atomic operations, and then aggregated by
> statsfs until the desired granularity.
Wait, reading a statistic from userspace can be slow, but writing to it
from userspace has to be fast? Or do you mean the speed is all for
reading/writing the value within the kernel?
> The first user of statsfs would be KVM, which is currently exposing its
> stats in debugfs. However, debugfs access is now limited by the
> security lock down patches, and in addition statsfs aims to be a
> more-or-less stable API, hence the idea of making it a separate
> filesystem and mount point.
Nice, I've had people ask about something like this for a while now.
For the most part they just dump stuff in sysfs instead (see the DRM
patches recently for people attempting to do that for debugfs values as
well.)
> A few people have already expressed interest in this. Christian
> Borntraeger presented on the kvm_stat tool recently at KVM Forum and was
> also thinking about using some high-level API in debugfs. Google has
> KVM patches to gather statistics in a binary format; it may be useful to
> add this kind of functionality (and some kind of introspection similar
> to what tracing does) to statsfs too in the future, but this is
> independent from the kernel API. I'm also CCing Alex Williamson, in
> case VFIO is interested in something similar, and Steven Rostedt because
> apparently he has enough free time to write poetry in addition to code.
>
> There are just two concepts in statsfs, namely "values" (aka files) and
> "sources" (directories).
>
> A value represents a single quantity that is gathered by the statsfs
> client. It could be the number of vmexits of a given kind, the amount
> of memory used by some data structure, the length of the longest hash
> table chain, or anything like that.
>
> Values are described by a struct like this one:
>
> struct statsfs_value {
> const char *name;
> enum stat_type type; /* STAT_TYPE_{BOOL,U64,...} */
> u16 aggr_kind; /* Bitmask with zero or more of
> * STAT_AGGR_{MIN,MAX,SUM,...}
> */
> u16 mode; /* File mode */
> int offset; /* Offset from base address
> * to field containing the value
> */
> };
>
> As you can see, values are basically integers stored somewhere in a
> struct. The statsfs_value struct also includes information on which
> operations (for example sum, min, max, average, count nonzero) it makes
> sense to expose when the values are aggregated.
What can userspace do with that info?
> Sources form the bulk of the statsfs API. They can include two kinds of
> elements:
>
> - values as described above. The common case is to have many values
> with the same base address, which are represented by an array of struct
> statsfs_value
>
> - subordinate sources
>
> Adding a subordinate source has two effects:
>
> - it creates a subdirectory for each subordinate source
>
> - for each value in the subordinate sources which has aggr_kind != 0,
> corresponding values will be created in the parent directory too. If
> multiple subordinate sources are backed by the same array of struct
> statsfs_value, values from all those sources will be aggregated. That
> is, statsfs will compute these from the values of all items in the list
> and show them in the parent directory.
>
> Writable values can only be written with a value of zero. Writing zero
> to an aggregate zeroes all the corresponding values in the subordinate
> sources.
>
> Sources are manipulated with these four functions:
>
> struct statsfs_source *statsfs_source_create(const char *fmt,
> ...);
> void statsfs_source_add_values(struct statsfs_source *source,
> struct statsfs_value *stat,
> int n, void *ptr);
> void statsfs_source_add_subordinate(
> struct statsfs_source *source,
> struct statsfs_source *sub);
> void statsfs_source_remove_subordinate(
> struct statsfs_source *source,
> struct statsfs_source *sub);
>
> Sources are reference counted, and for this reason there is also a pair
> of functions in the usual style:
>
> void statsfs_source_get(struct statsfs_source *);
> void statsfs_source_put(struct statsfs_source *);
>
> Finally,
>
> void statsfs_source_register(struct statsfs_source *source);
>
> lets you create a toplevel statsfs directory.
>
> As a practical example, KVM's usage of debugfs could be replaced by
> something like this:
>
> /* Globals */
> struct statsfs_value vcpu_stats[] = ...;
> struct statsfs_value vm_stats[] = ...;
> static struct statsfs_source *kvm_source;
>
> /* On module creation */
> kvm_source = statsfs_source_create("kvm");
> statsfs_source_register(kvm_source);
>
> /* On VM creation */
> kvm->src = statsfs_source_create("%d-%d\n",
> task_pid_nr(current), fd);
> statsfs_source_add_values(kvm->src, vm_stats,
> ARRAY_SIZE(vm_stats),
> &kvm->stats);
> statsfs_source_add_subordinate(kvm_source, kvm->src);
>
> /* On vCPU creation */
> vcpu_src = statsfs_source_create("vcpu%d\n", vcpu->vcpu_id);
> statsfs_source_add_values(vcpu_src, vcpu_stats,
> ARRAY_SIZE(vcpu_stats),
> &vcpu->stats);
> statsfs_source_add_subordinate(kvm->src, vcpu_src);
> /*
> * No need to keep the vcpu_src around since there's no
> * separate vCPU deletion event; rely on refcount
> * exclusively.
> */
> statsfs_source_put(vcpu_src);
>
> /* On VM deletion */
> statsfs_source_remove_subordinate(kvm_source, kvm->src);
> statsfs_source_put(kvm->src);
>
> /* On KVM exit */
> statsfs_source_put(kvm_source);
>
> How does this look?
Where does the actual values get changed that get reflected in the
filesystem?
I have some old notes somewhere about what people really want when it
comes to a good "statistics" datatype, that I was thinking of building
off of, but that seems independant of what you are doing here, right?
This is just exporting existing values to userspace in a semi-sane way?
Anyway, I like the idea, but what about how this is exposed to
userspace? The criticism of sysfs for statistics is that it is too slow
to open/read/close lots of files and tough to get "at this moment in
time these are all the different values" snapshots easily. How will
this be addressed here?
thanks,
greg k-h
next prev parent reply other threads:[~2019-11-09 15:49 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-06 15:56 "statsfs" API design Paolo Bonzini
2019-11-09 15:49 ` Greg Kroah-Hartman [this message]
2019-11-10 13:04 ` Paolo Bonzini
2019-11-26 10:09 ` Greg Kroah-Hartman
2019-11-26 10:50 ` Paolo Bonzini
2019-11-26 14:18 ` Greg Kroah-Hartman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191109154952.GA1365674@kroah.com \
--to=gregkh@linuxfoundation.org \
--cc=alex.williamson@redhat.com \
--cc=borntraeger@de.ibm.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=pfeiner@google.com \
--cc=rostedt@goodmis.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).