Re: rbd top - John Spray

All of lore.kernel.org
 help / color / mirror / Atom feed

From: John Spray <john.spray@redhat.com>
To: Sage Weil <sage@newdream.net>, Gregory Farnum <greg@gregs42.com>
Cc: Robert LeBlanc <robert@leblancnet.us>,
	ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: rbd top
Date: Mon, 15 Jun 2015 16:03:40 +0100	[thread overview]
Message-ID: <557EE94C.10108@redhat.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1506150647030.18741@cobra.newdream.net>

On 15/06/2015 14:52, Sage Weil wrote:
>
> I seem to remember having a short conversation about something like this a
> few CDS's back... although I think it was 'rados top'.  IIRC the basic
> idea we had was for each OSD to track it's top clients (using some
> approximate LRU type algorithm) and then either feed this relatively small
> amount of info (say, top 10-100 clients) back to the mon for summation,
> or dump via the admin socket for calamari to aggregate.
>
> This doesn't give you the rbd image name, but I bet we could infer that
> without too much trouble (e.g., include a recent object or two with the
> client).  Or, just assume that client id is enough (it'll include an IP
> and PID... enough info to find the /var/run/ceph admin socket or the VM
> process.
>
> If we were going to do top clients, I think it'd make sense to also have a
> top objects list as well, so you can see what the hottest objects in the
> cluster are.

The following is a bit of a tangent...

A few weeks ago I was thinking about general solutions to this problem 
(for the filesystem).  I played with (very briefly on wip-live-query) 
the idea of publishing a list of queries to the MDSs/OSDs, that would 
allow runtime configuration of what kind of thing we're interested in 
and how we want it broken down.

If we think of it as an SQL-like syntax, then for the RBD case we would 
have something like:
   SELECT read_bytes, write_bytes WHERE pool=rbd GROUP BY rbd_image

(You'd need a protocol-specific module of some kind to define what 
"rbd_image" meant here, which would do a simple mapping from object 
attributes to an identifier (similar would exist for e.g. cephfs inode))

Each time an OSD does an operation, it consults the list of active 
"performance queries" and updates counters according to the value of the 
GROUP BY parameter for the query (so the above example each OSD would be 
keeping a result row for each rbd image touchd).

The LRU part could be implemented as LIMIT BY + SORT parameters, such 
that the result rows would be periodically sorted and the least-touched 
results would drop off the list.  That would probably be used in 
conjunction with a decay operator on the sorted-by field, like:
   SELECT read_bytes, write_bytes,ops WHERE pool=rbd GROUP BY rbd_image 
SORT BY movingAverage(derivative(ops)) LIMIT 100

Combining WHERE clauses would let the user "drill down" (apologies for 
buzzword) by doing things like identifying the most busy clients, and 
then for each of those clients identify which images/files/objects the 
client is most active on, or vice versa identify busy objects and then 
see which clients are hitting them. Usually keeping around enough stats 
to enable this is prohibitive at scale, but it's fine when you're 
actively creating custom queries for the results you're really 
interested in, instead of keeping N_clients*N_objects stats, and when 
you have the LIMIT part to ensure results never get oversized.

The GROUP BY options would also include metadata sent from clients, e.g. 
the obvious cases like VM instance names, or rack IDs, or HPC job IDs.  
Maybe also some less obvious ones like decorating cephfs IOs with the 
inode of the directory containing the file, so that OSDs could 
accumulate per-directory bandwidth numbers, and user could ask "which 
directory is bandwidth-hottest?" as well as "which file is 
bandwidth-hottest?".

Then, after implementing all that craziness, you get some kind of wild 
multicolored GUI that shows you where the action is in your system at a 
cephfs/rgw/rbd level.

Cheers,
John

next prev parent reply	other threads:[~2015-06-15 15:03 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-11 19:33 rbd top Robert LeBlanc
2015-06-15 11:52 ` Gregory Farnum
2015-06-15 13:52   ` Sage Weil
2015-06-15 15:03     ` John Spray [this message]
2015-06-15 16:10       ` Robert LeBlanc
2015-06-15 16:52         ` John Spray
2015-06-16 11:05           ` Wido den Hollander
2015-06-17 17:06             ` Robert LeBlanc
2015-06-17 17:59               ` John Spray
2015-06-16 10:04       ` Gregory Farnum
2015-06-15 16:28     ` Robert LeBlanc

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=557EE94C.10108@redhat.com \
    --to=john.spray@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=greg@gregs42.com \
    --cc=robert@leblancnet.us \
    --cc=sage@newdream.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.