Re: "statsfs" API design - Greg Kroah-Hartman

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: KVM list <kvm@vger.kernel.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Peter Feiner <pfeiner@google.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: "statsfs" API design
Date: Sat, 9 Nov 2019 16:49:52 +0100	[thread overview]
Message-ID: <20191109154952.GA1365674@kroah.com> (raw)
In-Reply-To: <5d6cdcb1-d8ad-7ae6-7351-3544e2fa366d@redhat.com>

On Wed, Nov 06, 2019 at 04:56:25PM +0100, Paolo Bonzini wrote:
> Hi all,
> 
> statsfs is a proposal for a new Linux kernel synthetic filesystem, to be
> mounted in /sys/kernel/stats, which exposes subsystem-level statistics
> in sysfs.  Reading need not be particularly lightweight, but writing
> must be fast.  Therefore, statistics are gathered at a fine-grain level
> in order to avoid locking or atomic operations, and then aggregated by
> statsfs until the desired granularity.

Wait, reading a statistic from userspace can be slow, but writing to it
from userspace has to be fast?  Or do you mean the speed is all for
reading/writing the value within the kernel?

> The first user of statsfs would be KVM, which is currently exposing its
> stats in debugfs.  However, debugfs access is now limited by the
> security lock down patches, and in addition statsfs aims to be a
> more-or-less stable API, hence the idea of making it a separate
> filesystem and mount point.

Nice, I've had people ask about something like this for a while now.
For the most part they just dump stuff in sysfs instead (see the DRM
patches recently for people attempting to do that for debugfs values as
well.)

> A few people have already expressed interest in this.  Christian
> Borntraeger presented on the kvm_stat tool recently at KVM Forum and was
> also thinking about using some high-level API in debugfs.  Google has
> KVM patches to gather statistics in a binary format; it may be useful to
> add this kind of functionality (and some kind of introspection similar
> to what tracing does) to statsfs too in the future, but this is
> independent from the kernel API.  I'm also CCing Alex Williamson, in
> case VFIO is interested in something similar, and Steven Rostedt because
> apparently he has enough free time to write poetry in addition to code.
> 
> There are just two concepts in statsfs, namely "values" (aka files) and
> "sources" (directories).
> 
> A value represents a single quantity that is gathered by the statsfs
> client.  It could be the number of vmexits of a given kind, the amount
> of memory used by some data structure, the length of the longest hash
> table chain, or anything like that.
> 
> Values are described by a struct like this one:
> 
> 	struct statsfs_value {
> 		const char *name;
> 		enum stat_type type;	/* STAT_TYPE_{BOOL,U64,...} */
> 		u16 aggr_kind;		/* Bitmask with zero or more of
> 					 * STAT_AGGR_{MIN,MAX,SUM,...}
> 					 */
> 		u16 mode;		/* File mode */
> 		int offset;		/* Offset from base address
> 					 * to field containing the value
> 					 */
> 	};
> 
> As you can see, values are basically integers stored somewhere in a
> struct.   The statsfs_value struct also includes information on which
> operations (for example sum, min, max, average, count nonzero) it makes
> sense to expose when the values are aggregated.

What can userspace do with that info?

> Sources form the bulk of the statsfs API.  They can include two kinds of
> elements:
> 
> - values as described above.  The common case is to have many values
> with the same base address, which are represented by an array of struct
> statsfs_value
> 
> - subordinate sources
> 
> Adding a subordinate source has two effects:
> 
> - it creates a subdirectory for each subordinate source
> 
> - for each value in the subordinate sources which has aggr_kind != 0,
> corresponding values will be created in the parent directory too.  If
> multiple subordinate sources are backed by the same array of struct
> statsfs_value, values from all those sources will be aggregated.  That
> is, statsfs will compute these from the values of all items in the list
> and show them in the parent directory.
> 
> Writable values can only be written with a value of zero. Writing zero
> to an aggregate zeroes all the corresponding values in the subordinate
> sources.
> 
> Sources are manipulated with these four functions:
> 
> 	struct statsfs_source *statsfs_source_create(const char *fmt,
> 						     ...);
> 	void statsfs_source_add_values(struct statsfs_source *source,
> 				       struct statsfs_value *stat,
> 				       int n, void *ptr);
> 	void statsfs_source_add_subordinate(
> 					struct statsfs_source *source,
> 					struct statsfs_source *sub);
> 	void statsfs_source_remove_subordinate(
> 					struct statsfs_source *source,
> 					struct statsfs_source *sub);
> 
> Sources are reference counted, and for this reason there is also a pair
> of functions in the usual style:
> 
> 	void statsfs_source_get(struct statsfs_source *);
> 	void statsfs_source_put(struct statsfs_source *);
> 
> Finally,
> 
> 	void statsfs_source_register(struct statsfs_source *source);
> 
> lets you create a toplevel statsfs directory.
> 
> As a practical example, KVM's usage of debugfs could be replaced by
> something like this:
> 
> /* Globals */
> 	struct statsfs_value vcpu_stats[] = ...;
> 	struct statsfs_value vm_stats[] = ...;
> 	static struct statsfs_source *kvm_source;
> 
> /* On module creation */
> 	kvm_source = statsfs_source_create("kvm");
> 	statsfs_source_register(kvm_source);
> 
> /* On VM creation */
> 	kvm->src = statsfs_source_create("%d-%d\n",
> 				         task_pid_nr(current), fd);
> 	statsfs_source_add_values(kvm->src, vm_stats,
> 				  ARRAY_SIZE(vm_stats),
> 				  &kvm->stats);
> 	statsfs_source_add_subordinate(kvm_source, kvm->src);
> 
> /* On vCPU creation */
> 	vcpu_src = statsfs_source_create("vcpu%d\n", vcpu->vcpu_id);
> 	statsfs_source_add_values(vcpu_src, vcpu_stats,
> 				  ARRAY_SIZE(vcpu_stats),
> 				  &vcpu->stats);
> 	statsfs_source_add_subordinate(kvm->src, vcpu_src);
> 	/*
> 	 * No need to keep the vcpu_src around since there's no
> 	 * separate vCPU deletion event; rely on refcount
> 	 * exclusively.
> 	 */
> 	statsfs_source_put(vcpu_src);
> 
> /* On VM deletion */
> 	statsfs_source_remove_subordinate(kvm_source, kvm->src);
> 	statsfs_source_put(kvm->src);
> 
> /* On KVM exit */
> 	statsfs_source_put(kvm_source);
> 
> How does this look?

Where does the actual values get changed that get reflected in the
filesystem?

I have some old notes somewhere about what people really want when it
comes to a good "statistics" datatype, that I was thinking of building
off of, but that seems independant of what you are doing here, right?
This is just exporting existing values to userspace in a semi-sane way?

Anyway, I like the idea, but what about how this is exposed to
userspace?  The criticism of sysfs for statistics is that it is too slow
to open/read/close lots of files and tough to get "at this moment in
time these are all the different values" snapshots easily.  How will
this be addressed here?

thanks,

greg k-h

next prev parent reply	other threads:[~2019-11-09 15:49 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-06 15:56 "statsfs" API design Paolo Bonzini
2019-11-09 15:49 ` Greg Kroah-Hartman [this message]
2019-11-10 13:04   ` Paolo Bonzini
2019-11-26 10:09     ` Greg Kroah-Hartman
2019-11-26 10:50       ` Paolo Bonzini
2019-11-26 14:18         ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191109154952.GA1365674@kroah.com \
    --to=gregkh@linuxfoundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=borntraeger@de.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=pfeiner@google.com \
    --cc=rostedt@goodmis.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).