Re: "statsfs" API design - Greg Kroah-Hartman

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: KVM list <kvm@vger.kernel.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Peter Feiner <pfeiner@google.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: "statsfs" API design
Date: Sat, 9 Nov 2019 16:49:52 +0100	[thread overview]
Message-ID: <20191109154952.GA1365674@kroah.com> (raw)
In-Reply-To: <5d6cdcb1-d8ad-7ae6-7351-3544e2fa366d@redhat.com>

On Wed, Nov 06, 2019 at 04:56:25PM +0100, Paolo Bonzini wrote:
> Hi all,
> 
> statsfs is a proposal for a new Linux kernel synthetic filesystem, to be
> mounted in /sys/kernel/stats, which exposes subsystem-level statistics
> in sysfs.  Reading need not be particularly lightweight, but writing
> must be fast.  Therefore, statistics are gathered at a fine-grain level
> in order to avoid locking or atomic operations, and then aggregated by
> statsfs until the desired granularity.

Wait, reading a statistic from userspace can be slow, but writing to it
from userspace has to be fast?  Or do you mean the speed is all for
reading/writing the value within the kernel?

> The first user of statsfs would be KVM, which is currently exposing its
> stats in debugfs.  However, debugfs access is now limited by the
> security lock down patches, and in addition statsfs aims to be a
> more-or-less stable API, hence the idea of making it a separate
> filesystem and mount point.

Nice, I've had people ask about something like this for a while now.
For the most part they just dump stuff in sysfs instead (see the DRM
patches recently for people attempting to do that for debugfs values as
well.)

> A few people have already expressed interest in this.  Christian
> Borntraeger presented on the kvm_stat tool recently at KVM Forum and was
> also thinking about using some high-level API in debugfs.  Google has
> KVM patches to gather statistics in a binary format; it may be useful to
> add this kind of functionality (and some kind of introspection similar
> to what tracing does) to statsfs too in the future, but this is
> independent from the kernel API.  I'm also CCing Alex Williamson, in
> case VFIO is interested in something similar, and Steven Rostedt because
> apparently he has enough free time to write poetry in addition to code.
> 
> There are just two concepts in statsfs, namely "values" (aka files) and
> "sources" (directories).
> 
> A value represents a single quantity that is gathered by the statsfs
> client.  It could be the number of vmexits of a given kind, the amount
> of memory used by some data structure, the length of the longest hash
> table chain, or anything like that.
> 
> Values are described by a struct like this one:
> 
> 	struct statsfs_value {
> 		const char *name;
> 		enum stat_type type;	/* STAT_TYPE_{BOOL,U64,...} */
> 		u16 aggr_kind;		/* Bitmask with zero or more of
> 					 * STAT_AGGR_{MIN,MAX,SUM,...}
> 					 */
> 		u16 mode;		/* File mode */
> 		int offset;		/* Offset from base address
> 					 * to field containing the value
> 					 */
> 	};
> 
> As you can see, values are basically integers stored somewhere in a
> struct.   The statsfs_value struct also includes information on which
> operations (for example sum, min, max, average, count nonzero) it makes
> sense to expose when the values are aggregated.

What can userspace do with that info?

> Sources form the bulk of the statsfs API.  They can include two kinds of
> elements:
> 
> - values as described above.  The common case is to have many values
> with the same base address, which are represented by an array of struct
> statsfs_value
> 
> - subordinate sources
> 
> Adding a subordinate source has two effects:
> 
> - it creates a subdirectory for each subordinate source
> 
> - for each value in the subordinate sources which has aggr_kind != 0,
> corresponding values will be created in the parent directory too.  If
> multiple subordinate sources are backed by the same array of struct
> statsfs_value, values from all those sources will be aggregated.  That
> is, statsfs will compute these from the values of all items in the list
> and show them in the parent directory.
> 
> Writable values can only be written with a value of zero. Writing zero
> to an aggregate zeroes all the corresponding values in the subordinate
> sources.
> 
> Sources are manipulated with these four functions:
> 
> 	struct statsfs_source *statsfs_source_create(const char *fmt,
> 						     ...);
> 	void statsfs_source_add_values(struct statsfs_source *source,
> 				       struct statsfs_value *stat,
> 				       int n, void *ptr);
> 	void statsfs_source_add_subordinate(
> 					struct statsfs_source *source,
> 					struct statsfs_source *sub);
> 	void statsfs_source_remove_subordinate(
> 					struct statsfs_source *source,
> 					struct statsfs_source *sub);
> 
> Sources are reference counted, and for this reason there is also a pair
> of functions in the usual style:
> 
> 	void statsfs_source_get(struct statsfs_source *);
> 	void statsfs_source_put(struct statsfs_source *);
> 
> Finally,
> 
> 	void statsfs_source_register(struct statsfs_source *source);
> 
> lets you create a toplevel statsfs directory.
> 
> As a practical example, KVM's usage of debugfs could be replaced by
> something like this:
> 
> /* Globals */
> 	struct statsfs_value vcpu_stats[] = ...;
> 	struct statsfs_value vm_stats[] = ...;
> 	static struct statsfs_source *kvm_source;
> 
> /* On module creation */
> 	kvm_source = statsfs_source_create("kvm");
> 	statsfs_source_register(kvm_source);
> 
> /* On VM creation */
> 	kvm->src = statsfs_source_create("%d-%d\n",
> 				         task_pid_nr(current), fd);
> 	statsfs_source_add_values(kvm->src, vm_stats,
> 				  ARRAY_SIZE(vm_stats),
> 				  &kvm->stats);
> 	statsfs_source_add_subordinate(kvm_source, kvm->src);
> 
> /* On vCPU creation */
> 	vcpu_src = statsfs_source_create("vcpu%d\n", vcpu->vcpu_id);
> 	statsfs_source_add_values(vcpu_src, vcpu_stats,
> 				  ARRAY_SIZE(vcpu_stats),
> 				  &vcpu->stats);
> 	statsfs_source_add_subordinate(kvm->src, vcpu_src);
> 	/*
> 	 * No need to keep the vcpu_src around since there's no
> 	 * separate vCPU deletion event; rely on refcount
> 	 * exclusively.
> 	 */
> 	statsfs_source_put(vcpu_src);
> 
> /* On VM deletion */
> 	statsfs_source_remove_subordinate(kvm_source, kvm->src);
> 	statsfs_source_put(kvm->src);
> 
> /* On KVM exit */
> 	statsfs_source_put(kvm_source);
> 
> How does this look?

Where does the actual values get changed that get reflected in the
filesystem?

I have some old notes somewhere about what people really want when it
comes to a good "statistics" datatype, that I was thinking of building
off of, but that seems independant of what you are doing here, right?
This is just exporting existing values to userspace in a semi-sane way?

Anyway, I like the idea, but what about how this is exposed to
userspace?  The criticism of sysfs for statistics is that it is too slow
to open/read/close lots of files and tough to get "at this moment in
time these are all the different values" snapshots easily.  How will
this be addressed here?

thanks,

greg k-h

next prev parent reply	other threads:[~2019-11-09 15:49 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-06 15:56 "statsfs" API design Paolo Bonzini
2019-11-09 15:49 ` Greg Kroah-Hartman [this message]
2019-11-10 13:04   ` Paolo Bonzini
2019-11-26 10:09     ` Greg Kroah-Hartman
2019-11-26 10:50       ` Paolo Bonzini
2019-11-26 14:18         ` Greg Kroah-Hartman
  -- strict thread matches above, loose matches on Subject: below --
2019-11-09 18:44 Alexey Dobriyan
2019-11-10  9:14 ` Greg KH
2019-11-10 10:09   ` Brian Masney
2019-11-10 10:14     ` Greg KH
2019-11-10 10:19       ` Greg KH
2019-11-10 15:34   ` Alexey Dobriyan
2019-11-10 20:58     ` Paolo Bonzini
2019-11-11 20:40       ` Alexey Dobriyan
2019-11-26 10:07         ` Greg KH

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191109154952.GA1365674@kroah.com \
    --to=gregkh@linuxfoundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=borntraeger@de.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=pfeiner@google.com \
    --cc=rostedt@goodmis.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.