* Re: [lm-sensors] FAULT status of sensors
2011-05-16 18:23 [lm-sensors] FAULT status of sensors Jan Wagner
@ 2011-05-17 2:32 ` Guenter Roeck
2011-05-17 7:36 ` Jean Delvare
2011-05-17 9:33 ` Jan Wagner
2 siblings, 0 replies; 4+ messages in thread
From: Guenter Roeck @ 2011-05-17 2:32 UTC (permalink / raw)
To: lm-sensors
On Mon, May 16, 2011 at 02:23:44PM -0400, Jan Wagner wrote:
> Hi there,
>
> we got a bugreport[1] against our nagios-plugins package. Unfortunately we are
> unsure about what "FAULT" means.
> In case this is a hardware problem of a sensor in form of it got damaged, we
> would report "CRITICAL", as a problem occured.
> If this means there is a problem detecting the sensor or something software
> like problem, we would report "UNKNOWN" as this not means a hardware problem
> happened.
>
It is supposed to indicate a HW problem. Here is the text describing the sysfs attribute:
"Each input channel may have an associated fault file. This can be used
to notify open diodes, unconnected fans etc. where the hardware
supports it. When this boolean has value 1, the measurement for that
channel should not be trusted."
Note that "critical" in the hwmon ABI means that a critical limit has been reached.
You would get a "critical" alarm in this case. You might have a terminology problem
if you use "critical" for a hardware failure.
An undetected sensor should not show up in the first place.
Guenter
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [lm-sensors] FAULT status of sensors
2011-05-16 18:23 [lm-sensors] FAULT status of sensors Jan Wagner
2011-05-17 2:32 ` Guenter Roeck
@ 2011-05-17 7:36 ` Jean Delvare
2011-05-17 9:33 ` Jan Wagner
2 siblings, 0 replies; 4+ messages in thread
From: Jean Delvare @ 2011-05-17 7:36 UTC (permalink / raw)
To: lm-sensors
On Mon, 16 May 2011 19:32:52 -0700, Guenter Roeck wrote:
> On Mon, May 16, 2011 at 02:23:44PM -0400, Jan Wagner wrote:
> > Hi there,
> >
> > we got a bugreport[1] against our nagios-plugins package. Unfortunately we are
> > unsure about what "FAULT" means.
> > In case this is a hardware problem of a sensor in form of it got damaged, we
> > would report "CRITICAL", as a problem occured.
> > If this means there is a problem detecting the sensor or something software
> > like problem, we would report "UNKNOWN" as this not means a hardware problem
> > happened.
> >
>
> It is supposed to indicate a HW problem. Here is the text describing the sysfs attribute:
>
> "Each input channel may have an associated fault file. This can be used
> to notify open diodes, unconnected fans etc. where the hardware
> supports it. When this boolean has value 1, the measurement for that
> channel should not be trusted."
>
> Note that "critical" in the hwmon ABI means that a critical limit has been reached.
> You would get a "critical" alarm in this case. You might have a terminology problem
> if you use "critical" for a hardware failure.
>
> An undetected sensor should not show up in the first place.
In fact, FAULT can happen in two different cases. First case (most
common) is unused channel by the manufacturer and the channel should
indeed be ignored. Second case is thermal diode dying or fan stalling,
and reporting this makes sense. So I would:
* Ignore sensor channels which report FAULT when you start monitoring.
* Report FAULT as an actual problem if it happens later during
monitoring, for a channel which reported real values before. The
terminology is up to you.
--
Jean Delvare
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [lm-sensors] FAULT status of sensors
2011-05-16 18:23 [lm-sensors] FAULT status of sensors Jan Wagner
2011-05-17 2:32 ` Guenter Roeck
2011-05-17 7:36 ` Jean Delvare
@ 2011-05-17 9:33 ` Jan Wagner
2 siblings, 0 replies; 4+ messages in thread
From: Jan Wagner @ 2011-05-17 9:33 UTC (permalink / raw)
To: lm-sensors
[-- Attachment #1.1: Type: Text/Plain, Size: 2815 bytes --]
Hi Jonathan,
thanks for your bugreport.
On Tuesday 17 May 2011 09:36:00 Jean Delvare wrote:
> On Mon, 16 May 2011 19:32:52 -0700, Guenter Roeck wrote:
> > On Mon, May 16, 2011 at 02:23:44PM -0400, Jan Wagner wrote:
> > > we got a bugreport[1] against our nagios-plugins package. Unfortunately
> > > we are unsure about what "FAULT" means.
> > > In case this is a hardware problem of a sensor in form of it got
> > > damaged, we would report "CRITICAL", as a problem occured.
> > > If this means there is a problem detecting the sensor or something
> > > software like problem, we would report "UNKNOWN" as this not means a
> > > hardware problem happened.
> >
> > It is supposed to indicate a HW problem. Here is the text describing the
> > sysfs attribute:
> >
> > "Each input channel may have an associated fault file. This can be used
> >
> > to notify open diodes, unconnected fans etc. where the hardware
> > supports it. When this boolean has value 1, the measurement for that
> > channel should not be trusted."
> >
> > Note that "critical" in the hwmon ABI means that a critical limit has
> > been reached. You would get a "critical" alarm in this case. You might
> > have a terminology problem if you use "critical" for a hardware failure.
> >
> > An undetected sensor should not show up in the first place.
>
> In fact, FAULT can happen in two different cases. First case (most
> common) is unused channel by the manufacturer and the channel should
> indeed be ignored. Second case is thermal diode dying or fan stalling,
> and reporting this makes sense. So I would:
> * Ignore sensor channels which report FAULT when you start monitoring.
> * Report FAULT as an actual problem if it happens later during
> monitoring, for a channel which reported real values before. The
> terminology is up to you.
(Jean: many thanks for your clarification)
This means, that FAULT can be happen, if the hardware conditions are fine and
hardware is failing too.
On Saturday 26 February 2011 00:07:22 Jonathan Wiltshire wrote:
> The attached patch causes check_sensors to return a critical status if
> faulty sensors are detected.
For nagios-plugins this means, we don't know if there is exactly a problem. We
should report "UNKNOWN" via check_sensors if "FAULT" is reported by the
sensor. As the source of this may also not a problem with the hardware
conditions itself, something like --ignore-fault needs to be implemented too.
With kind regards, Jan.
--
Never write mail to <waja@spamfalle.info>, you have been warned!
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GIT d-- s+: a C+++ UL++++ P+ L+++ E--- W+++ N+++ o++ K++ w--- O M V- PS PE Y++
PGP++ t-- 5 X R tv- b+ DI D+ G++ e++ h---- r+++ y++++
------END GEEK CODE BLOCK------
[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 190 bytes --]
[-- Attachment #2: Type: text/plain, Size: 153 bytes --]
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 4+ messages in thread