Suggestions on tracker 13578

All of lore.kernel.org
 help / color / mirror / Atom feed

* Suggestions on tracker 13578
@ 2015-12-01 13:23 Vimal
  2015-12-02 18:23 ` Gregory Farnum
  0 siblings, 1 reply; 5+ messages in thread
From: Vimal @ 2015-12-01 13:23 UTC (permalink / raw)
  To: ceph-devel

Hello,

This mail is to discuss the feature request at 
http://tracker.ceph.com/issues/13578.

If done, such a tool should help point out several mis-configurations 
that may cause problems in a cluster later.

Some of the suggestions are:

a) A check to understand if the MONs and OSD nodes are on the same machines.

b) If /var is a separate partition or not, to prevent the root 
filesystem from being filled up.

c) If monitors are deployed in different failure domains or not.

d) If the OSDs are deployed in different failure domains.

e) If a journal disk is used for more than six OSDs. Right now, the 
documentation suggests upto 6 OSD journals to exist on a single journal 
disk.

f) Failure domains depending on the power source.

There can be several more checks, and it can be a useful tool to test 
the problems an existing cluster or a new installation.

But I'd like to know how the engineering community sees this, if its 
seems to be worth pursuing, and what suggestions do you have for 
improving/adding to this.

Thank you,

Vimal

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Suggestions on tracker 13578
  2015-12-01 13:23 Suggestions on tracker 13578 Vimal
@ 2015-12-02 18:23 ` Gregory Farnum
  2015-12-02 19:04   ` Mark Nelson
  0 siblings, 1 reply; 5+ messages in thread
From: Gregory Farnum @ 2015-12-02 18:23 UTC (permalink / raw)
  To: Vimal; +Cc: ceph-devel

On Tue, Dec 1, 2015 at 5:23 AM, Vimal <vikumar@redhat.com> wrote:
> Hello,
>
> This mail is to discuss the feature request at
> http://tracker.ceph.com/issues/13578.
>
> If done, such a tool should help point out several mis-configurations that
> may cause problems in a cluster later.
>
> Some of the suggestions are:
>
> a) A check to understand if the MONs and OSD nodes are on the same machines.
>
> b) If /var is a separate partition or not, to prevent the root filesystem
> from being filled up.
>
> c) If monitors are deployed in different failure domains or not.
>
> d) If the OSDs are deployed in different failure domains.
>
> e) If a journal disk is used for more than six OSDs. Right now, the
> documentation suggests upto 6 OSD journals to exist on a single journal
> disk.
>
> f) Failure domains depending on the power source.
>
> There can be several more checks, and it can be a useful tool to test the
> problems an existing cluster or a new installation.
>
> But I'd like to know how the engineering community sees this, if its seems
> to be worth pursuing, and what suggestions do you have for improving/adding
> to this.

This is a user experience and support tool; I don't think the
engineering community can really judge its value. ;)

So sure, sounds good to me. It'll need to get into the hands of users
before we find out if it's a good plan or not. I was at the SDI Summit
yesterday and was hearing about how some of our choices (like
HEALTH_WARN on pg counts) are *really* scary for users who think
they're in danger of losing data. I suspect the difficulty of a tool
like this will be more in the communication of issues and severity,
more than in what exactly we choose to check.
-Greg

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Suggestions on tracker 13578
  2015-12-02 18:23 ` Gregory Farnum
@ 2015-12-02 19:04   ` Mark Nelson
  2015-12-02 19:54     ` Paul Von-Stamwitz
  0 siblings, 1 reply; 5+ messages in thread
From: Mark Nelson @ 2015-12-02 19:04 UTC (permalink / raw)
  To: Gregory Farnum, Vimal; +Cc: ceph-devel



On 12/02/2015 12:23 PM, Gregory Farnum wrote:
> On Tue, Dec 1, 2015 at 5:23 AM, Vimal <vikumar@redhat.com> wrote:
>> Hello,
>>
>> This mail is to discuss the feature request at
>> http://tracker.ceph.com/issues/13578.
>>
>> If done, such a tool should help point out several mis-configurations that
>> may cause problems in a cluster later.
>>
>> Some of the suggestions are:
>>
>> a) A check to understand if the MONs and OSD nodes are on the same machines.
>>
>> b) If /var is a separate partition or not, to prevent the root filesystem
>> from being filled up.
>>
>> c) If monitors are deployed in different failure domains or not.
>>
>> d) If the OSDs are deployed in different failure domains.
>>
>> e) If a journal disk is used for more than six OSDs. Right now, the
>> documentation suggests upto 6 OSD journals to exist on a single journal
>> disk.
>>
>> f) Failure domains depending on the power source.
>>
>> There can be several more checks, and it can be a useful tool to test the
>> problems an existing cluster or a new installation.
>>
>> But I'd like to know how the engineering community sees this, if its seems
>> to be worth pursuing, and what suggestions do you have for improving/adding
>> to this.
>
> This is a user experience and support tool; I don't think the
> engineering community can really judge its value. ;)
>
> So sure, sounds good to me. It'll need to get into the hands of users
> before we find out if it's a good plan or not. I was at the SDI Summit
> yesterday and was hearing about how some of our choices (like
> HEALTH_WARN on pg counts) are *really* scary for users who think
> they're in danger of losing data. I suspect the difficulty of a tool
> like this will be more in the communication of issues and severity,
> more than in what exactly we choose to check.

Frankly I've never been a big fan of how we report warnings like this 
through the health check.  It's important to let users know if they've 
set up things sub-optimally, but I don't think ceph health is the way to 
do it.  The difference between your doctor telling you you should 
exercise more and lose a few pounds vs you have Ebola and are going to 
suffer an incredibly gruesome and painful death in the next 48 hours. :)

> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Suggestions on tracker 13578
  2015-12-02 19:04   ` Mark Nelson
@ 2015-12-02 19:54     ` Paul Von-Stamwitz
  2015-12-02 20:34       ` John Spray
  0 siblings, 1 reply; 5+ messages in thread
From: Paul Von-Stamwitz @ 2015-12-02 19:54 UTC (permalink / raw)
  To: Mark Nelson, Gregory Farnum, Vimal; +Cc: ceph-devel

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Wednesday, December 02, 2015 11:04 AM
> To: Gregory Farnum; Vimal
> Cc: ceph-devel
> Subject: Re: Suggestions on tracker 13578
> 
> 
> On 12/02/2015 12:23 PM, Gregory Farnum wrote:
> > On Tue, Dec 1, 2015 at 5:23 AM, Vimal <vikumar@redhat.com> wrote:
> >> Hello,
> >>
> >> This mail is to discuss the feature request at
> >> http://tracker.ceph.com/issues/13578.
> >>
> >> If done, such a tool should help point out several mis-configurations
> >> that may cause problems in a cluster later.
> >>
> >> Some of the suggestions are:
> >>
> >> a) A check to understand if the MONs and OSD nodes are on the same
> machines.
> >>
> >> b) If /var is a separate partition or not, to prevent the root
> >> filesystem from being filled up.
> >>
> >> c) If monitors are deployed in different failure domains or not.
> >>
> >> d) If the OSDs are deployed in different failure domains.
> >>
> >> e) If a journal disk is used for more than six OSDs. Right now, the
> >> documentation suggests upto 6 OSD journals to exist on a single
> >> journal disk.
> >>
> >> f) Failure domains depending on the power source.
> >>
> >> There can be several more checks, and it can be a useful tool to test
> >> the problems an existing cluster or a new installation.
> >>
> >> But I'd like to know how the engineering community sees this, if its
> >> seems to be worth pursuing, and what suggestions do you have for
> >> improving/adding to this.
> >
> > This is a user experience and support tool; I don't think the
> > engineering community can really judge its value. ;)
> >
> > So sure, sounds good to me. It'll need to get into the hands of users
> > before we find out if it's a good plan or not. I was at the SDI Summit
> > yesterday and was hearing about how some of our choices (like
> > HEALTH_WARN on pg counts) are *really* scary for users who think
> > they're in danger of losing data. I suspect the difficulty of a tool
> > like this will be more in the communication of issues and severity,
> > more than in what exactly we choose to check.
> 
> Frankly I've never been a big fan of how we report warnings like this through
> the health check.  It's important to let users know if they've set up things
> sub-optimally, but I don't think ceph health is the way to do it.  The
> difference between your doctor telling you you should exercise more and
> lose a few pounds vs you have Ebola and are going to suffer an incredibly
> gruesome and painful death in the next 48 hours. :)
> 

Since I was the one at the SDI Summit that took issue with some of these warnings, I whole-heartedly agree with Greg's and Mark's comments. A warning at health check should indicate to the user that some corrective action should be taken, besides turning the warning off :-) I do not have an issue reporting advisories, but they should be kept separate true warnings. If we want to notify the user of variances from best practices, I suggest a separate method, i.e. "ceph advise", rather than constantly repeating them on health checks.

> > -Greg
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Suggestions on tracker 13578
  2015-12-02 19:54     ` Paul Von-Stamwitz
@ 2015-12-02 20:34       ` John Spray
  0 siblings, 0 replies; 5+ messages in thread
From: John Spray @ 2015-12-02 20:34 UTC (permalink / raw)
  To: Paul Von-Stamwitz; +Cc: Mark Nelson, Gregory Farnum, Vimal, ceph-devel

On Wed, Dec 2, 2015 at 7:54 PM, Paul Von-Stamwitz
<PVonStamwitz@us.fujitsu.com> wrote:
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Wednesday, December 02, 2015 11:04 AM
>> To: Gregory Farnum; Vimal
>> Cc: ceph-devel
>> Subject: Re: Suggestions on tracker 13578
>>
>>
>> On 12/02/2015 12:23 PM, Gregory Farnum wrote:
>> > On Tue, Dec 1, 2015 at 5:23 AM, Vimal <vikumar@redhat.com> wrote:
>> >> Hello,
>> >>
>> >> This mail is to discuss the feature request at
>> >> http://tracker.ceph.com/issues/13578.
>> >>
>> >> If done, such a tool should help point out several mis-configurations
>> >> that may cause problems in a cluster later.
>> >>
>> >> Some of the suggestions are:
>> >>
>> >> a) A check to understand if the MONs and OSD nodes are on the same
>> machines.
>> >>
>> >> b) If /var is a separate partition or not, to prevent the root
>> >> filesystem from being filled up.
>> >>
>> >> c) If monitors are deployed in different failure domains or not.
>> >>
>> >> d) If the OSDs are deployed in different failure domains.
>> >>
>> >> e) If a journal disk is used for more than six OSDs. Right now, the
>> >> documentation suggests upto 6 OSD journals to exist on a single
>> >> journal disk.
>> >>
>> >> f) Failure domains depending on the power source.
>> >>
>> >> There can be several more checks, and it can be a useful tool to test
>> >> the problems an existing cluster or a new installation.
>> >>
>> >> But I'd like to know how the engineering community sees this, if its
>> >> seems to be worth pursuing, and what suggestions do you have for
>> >> improving/adding to this.
>> >
>> > This is a user experience and support tool; I don't think the
>> > engineering community can really judge its value. ;)
>> >
>> > So sure, sounds good to me. It'll need to get into the hands of users
>> > before we find out if it's a good plan or not. I was at the SDI Summit
>> > yesterday and was hearing about how some of our choices (like
>> > HEALTH_WARN on pg counts) are *really* scary for users who think
>> > they're in danger of losing data. I suspect the difficulty of a tool
>> > like this will be more in the communication of issues and severity,
>> > more than in what exactly we choose to check.
>>
>> Frankly I've never been a big fan of how we report warnings like this through
>> the health check.  It's important to let users know if they've set up things
>> sub-optimally, but I don't think ceph health is the way to do it.  The
>> difference between your doctor telling you you should exercise more and
>> lose a few pounds vs you have Ebola and are going to suffer an incredibly
>> gruesome and painful death in the next 48 hours. :)
>>
>
> Since I was the one at the SDI Summit that took issue with some of these warnings, I whole-heartedly agree with Greg's and Mark's comments. A warning at health check should indicate to the user that some corrective action should be taken, besides turning the warning off :-) I do not have an issue reporting advisories, but they should be kept separate true warnings. If we want to notify the user of variances from best practices, I suggest a separate method, i.e. "ceph advise", rather than constantly repeating them on health checks.

Separating things into "advise" vs. "health" probably doesn't solve
the problem, because one has to decide what goes in which section, and
ends up with the same problem as INFO/WARN/ERR categorisation -- the
idea of having different categories is fine, the hard part is
assigning particular items to a category in a way that makes sense for
different users.

IMHO the core problems are attempting to collapse all these
notifications into a global indicator, and attempting to do that in
the same way for all systems.  It needs to be finer grained than that.
I never got around to doing anything with #7192 [1], but it outlines a
way to change the health outlet into a form where it's easier to
selectively ignore particular items.

Once you break down the health output into a set of known status
codes, a natural extension would be to have user-configurable masks,
so that they could cancel particular warnings if they wanted to.
Think of it like having the ability to press the warning lights in an
aeroplane cockpit to turn off the alarm sound.

John

1. http://tracker.ceph.com/issues/7192

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-12-02 20:36 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-01 13:23 Suggestions on tracker 13578 Vimal
2015-12-02 18:23 ` Gregory Farnum
2015-12-02 19:04   ` Mark Nelson
2015-12-02 19:54     ` Paul Von-Stamwitz
2015-12-02 20:34       ` John Spray

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.