All of lore.kernel.org
 help / color / mirror / Atom feed
* [lm-sensors] sensord exits on any error
@ 2008-12-05 18:34 Andy Poling
  2008-12-05 20:27 ` Jean Delvare
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Andy Poling @ 2008-12-05 18:34 UTC (permalink / raw)
  To: lm-sensors

We are using lm-sensors in an embedded system, and have noticed that when
errors occur, sensord exits.

We occasionally encounter SMBus collisions which cause transient errors on
SMBus reads by the sensor chip driver.

We modified the most recent w83793 driver (which is much improved in dealing
with SMBus issues) to return cached data for up to 30 seconds in the case of
SMBus errors, and then to return EAGAIN on the sysfs file read if the SMBus
errors persist.

However sensord exits when it gets EAGAIN.  We have patched sensord not to
exit on errors, but instead to log them and continue.  This seems to us to be
the robust behavior for an important system monitoring daemon - if the error
is transient it rides it out, and if it is permanent it complains to get
attention.  It's about a 2-line change in the main loop of sensord.

I'd like to push these patches upstream, but wanted to first see if it is
acceptable to change the behavior of the w83793 driver and sensord in this
way.

-Andy

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [lm-sensors] sensord exits on any error
  2008-12-05 18:34 [lm-sensors] sensord exits on any error Andy Poling
@ 2008-12-05 20:27 ` Jean Delvare
  2008-12-06  0:34 ` Andy Poling
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Jean Delvare @ 2008-12-05 20:27 UTC (permalink / raw)
  To: lm-sensors

Hi Andy,

On Fri, 5 Dec 2008 12:34:47 -0600 (CST), Andy Poling wrote:
> We are using lm-sensors in an embedded system, and have noticed that when
> errors occur, sensord exits.

This is an known issue, tracked as ticket #2330:
http://www.lm-sensors.org/ticket/2330

> We occasionally encounter SMBus collisions which cause transient errors on
> SMBus reads by the sensor chip driver.

Multi-master bus?

> We modified the most recent w83793 driver (which is much improved in dealing
> with SMBus issues) to return cached data for up to 30 seconds in the case of
> SMBus errors, and then to return EAGAIN on the sysfs file read if the SMBus
> errors persist.
> 
> However sensord exits when it gets EAGAIN.  We have patched sensord not to
> exit on errors, but instead to log them and continue.  This seems to us to be
> the robust behavior for an important system monitoring daemon - if the error
> is transient it rides it out, and if it is permanent it complains to get
> attention.  It's about a 2-line change in the main loop of sensord.
> 
> I'd like to push these patches upstream, but wanted to first see if it is
> acceptable to change the behavior of the w83793 driver and sensord in this
> way.

Your changes to the w83793d drivers are IMHO not acceptable. It is up
to user-space to decide what to do when a sensor value can't be read.
Silently caching the values for an arbitrary period of 30 seconds isn't
nice. Returning errors immediately, OTOH would probably be better than
returning 0 as the driver does at the moment. Whether the error value
should be -EAGAIN or -EIO can be discussed. This is however a
non-trivial change due to the 2-second caching strategy that the driver
implements. But you probably already know that if you modified the
driver for your own use already. An easier approach would be to simply
retry on read failures, as I suspect the second read attempt would
almost always succeed.

Your fix to sensord is totally welcome. I could never find the time to
work on ticket #2330, so if you have a working patch I will be very
happy to review and apply it.

Thanks,
-- 
Jean Delvare

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [lm-sensors] sensord exits on any error
  2008-12-05 18:34 [lm-sensors] sensord exits on any error Andy Poling
  2008-12-05 20:27 ` Jean Delvare
@ 2008-12-06  0:34 ` Andy Poling
  2008-12-06  8:48 ` Jean Delvare
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Andy Poling @ 2008-12-06  0:34 UTC (permalink / raw)
  To: lm-sensors

On Fri, 5 Dec 2008, Jean Delvare wrote:
>> We occasionally encounter SMBus collisions which cause transient errors on
>> SMBus reads by the sensor chip driver.
>
> Multi-master bus?

We theorize that it may be the BIOS or ACPI periodically checking on CPU
temperature or some similar activity.

With older versions of the i2c-i801 driver, it would leave the SMBus wedged
and it was game-over, but the most recent version (which we back-ported)
finally seems to grapple effectively with SMBus collisions so that they're
truly transient.

>> We modified the most recent w83793 driver (which is much improved in dealing
>> with SMBus issues) to return cached data for up to 30 seconds in the case of
>> SMBus errors, and then to return EAGAIN on the sysfs file read if the SMBus
>> errors persist.
>
> Your changes to the w83793d drivers are IMHO not acceptable. It is up
> to user-space to decide what to do when a sensor value can't be read.
> Silently caching the values for an arbitrary period of 30 seconds isn't
> nice. Returning errors immediately, OTOH would probably be better than
> returning 0 as the driver does at the moment. Whether the error value
> should be -EAGAIN or -EIO can be discussed. This is however a
> non-trivial change due to the 2-second caching strategy that the driver
> implements. But you probably already know that if you modified the
> driver for your own use already. An easier approach would be to simply
> retry on read failures, as I suspect the second read attempt would
> almost always succeed.

Yep - agreed on the lengthy caching being problematic - we were looking to
kill it dead on the first whack, and thus overshot the mark.

As you mentioned, the problem with the un-patched w83793 driver is that it
returns bad data and no error on SMBus errors rather than returning an error.

If we want to be consistent with your paradigm of letting user-space decide
how to deal with the problem, I think we should just return EAGAIN (or EIO) on
a failure, and it would then be up to sensord to deal with it.  With our patch
to sensord, it would naturally try again on the next read interval.

It wouldn't be at all difficult to remove the extended cache we added, leaving
just the return of error on failure.

What do you think?


> Your fix to sensord is totally welcome. I could never find the time to
> work on ticket #2330, so if you have a working patch I will be very
> happy to review and apply it.

It looks like I can't add to the ticket.  What's the best way to submit a
patch?  Just send it to this same list?  Is a patch against 2.10.7 OK?

-Andy

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [lm-sensors] sensord exits on any error
  2008-12-05 18:34 [lm-sensors] sensord exits on any error Andy Poling
  2008-12-05 20:27 ` Jean Delvare
  2008-12-06  0:34 ` Andy Poling
@ 2008-12-06  8:48 ` Jean Delvare
  2008-12-12  8:01 ` Jean Delvare
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Jean Delvare @ 2008-12-06  8:48 UTC (permalink / raw)
  To: lm-sensors

Hi Andy,

On Fri, 5 Dec 2008 18:34:37 -0600 (CST), Andy Poling wrote:
> On Fri, 5 Dec 2008, Jean Delvare wrote:
> > Multi-master bus?
> 
> We theorize that it may be the BIOS or ACPI periodically checking on CPU
> temperature or some similar activity.
>
> With older versions of the i2c-i801 driver, it would leave the SMBus wedged
> and it was game-over, but the most recent version (which we back-ported)
> finally seems to grapple effectively with SMBus collisions so that they're
> truly transient.

You should really figure out. If there is a second master on the bus
then that's OK, I2C/SMBus is designed for this. If OTOH either APCI or
the BIOS is using the Intel 82801 as well, then the risks of
misbehavior, data loss or corruption are high.

Which kernel are you running?

I wrote a patch years ago to detect concurrent accesses to the Intel
82801 SMBus, you might be interested in it.

> > Your changes to the w83793d drivers are IMHO not acceptable. It is up
> > to user-space to decide what to do when a sensor value can't be read.
> > Silently caching the values for an arbitrary period of 30 seconds isn't
> > nice. Returning errors immediately, OTOH would probably be better than
> > returning 0 as the driver does at the moment. Whether the error value
> > should be -EAGAIN or -EIO can be discussed. This is however a
> > non-trivial change due to the 2-second caching strategy that the driver
> > implements. But you probably already know that if you modified the
> > driver for your own use already. An easier approach would be to simply
> > retry on read failures, as I suspect the second read attempt would
> > almost always succeed.
> 
> Yep - agreed on the lengthy caching being problematic - we were looking to
> kill it dead on the first whack, and thus overshot the mark.
> 
> As you mentioned, the problem with the un-patched w83793 driver is that it
> returns bad data and no error on SMBus errors rather than returning an error.
> 
> If we want to be consistent with your paradigm of letting user-space decide
> how to deal with the problem, I think we should just return EAGAIN (or EIO) on
> a failure, and it would then be up to sensord to deal with it.  With our patch
> to sensord, it would naturally try again on the next read interval.
> 
> It wouldn't be at all difficult to remove the extended cache we added, leaving
> just the return of error on failure.
> 
> What do you think?

Yes, I would like to see your patch, and we'll apply it if it looks
correct.

> > Your fix to sensord is totally welcome. I could never find the time to
> > work on ticket #2330, so if you have a working patch I will be very
> > happy to review and apply it.
> 
> It looks like I can't add to the ticket.  What's the best way to submit a
> patch?  Just send it to this same list?  Is a patch against 2.10.7 OK?

Yes, a patch against 2.10.7 is OK. Please send it to the lm-sensors
mailing list, either inline or as an attachment. I'll take care of the
patch and the ticket. Alternatively, if you intend to contribute more
to the lm-sensors project in the future, I can create a wiki account
for you and you can attach your patch to the bug yourself. Tell me what
you prefer.

Thanks,
-- 
Jean Delvare

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [lm-sensors] sensord exits on any error
  2008-12-05 18:34 [lm-sensors] sensord exits on any error Andy Poling
                   ` (2 preceding siblings ...)
  2008-12-06  8:48 ` Jean Delvare
@ 2008-12-12  8:01 ` Jean Delvare
  2008-12-12 21:45 ` Andy Poling
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Jean Delvare @ 2008-12-12  8:01 UTC (permalink / raw)
  To: lm-sensors

Hi Andy,

On Sun, 7 Dec 2008 21:41:13 -0600 (CST), Andy Poling wrote:
> I'll put together updated patches for sensord and the w83793 driver and send
> them to the list.

Any news from the sensord patch? I am waiting for it before I release
lm-sensors 2.10.8, and I'd really like to release it before I leave for
vacation.

Thanks,
-- 
Jean Delvare

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [lm-sensors] sensord exits on any error
  2008-12-05 18:34 [lm-sensors] sensord exits on any error Andy Poling
                   ` (3 preceding siblings ...)
  2008-12-12  8:01 ` Jean Delvare
@ 2008-12-12 21:45 ` Andy Poling
  2008-12-14 15:44 ` Jean Delvare
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Andy Poling @ 2008-12-12 21:45 UTC (permalink / raw)
  To: lm-sensors

[-- Attachment #1: Type: TEXT/PLAIN, Size: 491 bytes --]

On Fri, 12 Dec 2008, Jean Delvare wrote:
>> I'll put together updated patches for sensord and the w83793 driver and send
>> them to the list.
>
> Any news from the sensord patch? I am waiting for it before I release
> lm-sensors 2.10.8, and I'd really like to release it before I leave for
> vacation.

My apologies - for some reason I was holding this until we also had the
updated patch for w83793.c ready.

It is kind of embarrassingly simple.  I hope a unified diff is acceptable.

-Andy

[-- Attachment #2: sensord.c.pat --]
[-- Type: IMAGE/x-coreldrawpattern, Size: 1557 bytes --]

[-- Attachment #3: Type: text/plain, Size: 153 bytes --]

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [lm-sensors] sensord exits on any error
  2008-12-05 18:34 [lm-sensors] sensord exits on any error Andy Poling
                   ` (4 preceding siblings ...)
  2008-12-12 21:45 ` Andy Poling
@ 2008-12-14 15:44 ` Jean Delvare
  2008-12-16  9:05 ` Andy Poling
  2008-12-16  9:29 ` Jean Delvare
  7 siblings, 0 replies; 9+ messages in thread
From: Jean Delvare @ 2008-12-14 15:44 UTC (permalink / raw)
  To: lm-sensors

[-- Attachment #1: Type: text/plain, Size: 1058 bytes --]

Hi Andy,

On Fri, 12 Dec 2008 15:45:35 -0600 (CST), Andy Poling wrote:
> It is kind of embarrassingly simple.  I hope a unified diff is acceptable.

A unified diff is perfect. However I do not think the fix is as simple
as you suggested. The original code has a rather fragile way to handle
sleep times between actions, and now that failures no longer break the
loop, odd things can happen. In particular, with your patch, I hit a
case where the system log would get filled at a very high rate on
permanent errors, presumably because sleep() was called with negative
values.

Please see my attached patch which hopefully fixes all the issues. It
worked fine for me. Main differences with your original patch:
* Errors on reloadLib() are logged.
* Errors are logged using sensorLog() instead of syslog().
* Error messages use %d instead of %m. %m read errno but the sensord
  code doesn't set errno.
* Each of the 3 actions are handled separately, even if one fails, the
  other ones are attempted.

Please give it a try and report.

Thanks,
-- 
Jean Delvare

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: sensord-survive-transient-errors.patch --]
[-- Type: text/x-patch; name=sensord-survive-transient-errors.patch, Size: 1967 bytes --]

Index: prog/sensord/sensord.c
===================================================================
--- prog/sensord/sensord.c	(révision 5564)
+++ prog/sensord/sensord.c	(copie de travail)
@@ -85,27 +85,30 @@
 
   sensorLog (LOG_INFO, "sensord started");
 
-  while (!done && (ret == 0)) {
-    if (ret == 0)
-      ret = reloadLib ();
-    if ((ret == 0) && scanTime) { /* should I scan on the read cycle? */
-      ret = scanChips ();
-      if (scanValue <= 0)
-        scanValue += scanTime;
+  while (!done) {
+    ret = reloadLib ();
+    if (ret)
+      sensorLog (LOG_NOTICE, "config reload error (%d)", ret);
+    if (scanTime && (scanValue <= 0)) {
+      if ((ret = scanChips ()))
+        sensorLog (LOG_NOTICE, "sensor scan error (%d)", ret);
+      scanValue += scanTime;
     }
-    if ((ret == 0) && logTime && (logValue <= 0)) {
-      ret = readChips ();
+    if (logTime && (logValue <= 0)) {
+      if ((ret = readChips ()))
+        sensorLog (LOG_NOTICE, "sensor read error (%d)", ret);
       logValue += logTime;
     }
-    if ((ret == 0) && rrdTime && rrdFile && (rrdValue <= 0)) {
-      ret = rrdUpdate ();
+    if (rrdTime && rrdFile && (rrdValue <= 0)) {
+      if ((ret = rrdUpdate ()))
+        sensorLog (LOG_NOTICE, "rrd update error (%d)", ret);
       /*
        * The amount of time to wait is computed using the same method as
        * in RRD instead of simply adding the interval.
        */
       rrdValue = rrdTime - time(NULL) % rrdTime;
     }
-    if (!done && (ret == 0)) {
+    if (!done) {
       int a = logTime ? logValue : INT_MAX;
       int b = scanTime ? scanValue : INT_MAX;
       int c = (rrdTime && rrdFile) ? rrdValue : INT_MAX;
@@ -117,10 +120,7 @@
     }
   }
 
-  if (ret)
-    sensorLog (LOG_INFO, "sensord failed (%d)", ret);
-  else
-    sensorLog (LOG_INFO, "sensord stopped");
+  sensorLog (LOG_INFO, "sensord stopped");
 
   return ret;
 }

[-- Attachment #3: Type: text/plain, Size: 153 bytes --]

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [lm-sensors] sensord exits on any error
  2008-12-05 18:34 [lm-sensors] sensord exits on any error Andy Poling
                   ` (5 preceding siblings ...)
  2008-12-14 15:44 ` Jean Delvare
@ 2008-12-16  9:05 ` Andy Poling
  2008-12-16  9:29 ` Jean Delvare
  7 siblings, 0 replies; 9+ messages in thread
From: Andy Poling @ 2008-12-16  9:05 UTC (permalink / raw)
  To: lm-sensors

On Sun, 14 Dec 2008, Jean Delvare wrote:
> Please give it a try and report.

It looks good - I should be able to try it out tomorrow.

-Andy




_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [lm-sensors] sensord exits on any error
  2008-12-05 18:34 [lm-sensors] sensord exits on any error Andy Poling
                   ` (6 preceding siblings ...)
  2008-12-16  9:05 ` Andy Poling
@ 2008-12-16  9:29 ` Jean Delvare
  7 siblings, 0 replies; 9+ messages in thread
From: Jean Delvare @ 2008-12-16  9:29 UTC (permalink / raw)
  To: lm-sensors

On Tue, 16 Dec 2008 03:05:36 -0600 (CST), Andy Poling wrote:
> On Sun, 14 Dec 2008, Jean Delvare wrote:
> > Please give it a try and report.
> 
> It looks good - I should be able to try it out tomorrow.

Great, thanks. I have to release lm-sensors 2.10.8 today, so I can't
wait for your feedback, but please still report if you find any issue
with the patch. At least it worked fine for me so far so I have good
hope that it'll work for you as well.

-- 
Jean Delvare

_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-12-16  9:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-05 18:34 [lm-sensors] sensord exits on any error Andy Poling
2008-12-05 20:27 ` Jean Delvare
2008-12-06  0:34 ` Andy Poling
2008-12-06  8:48 ` Jean Delvare
2008-12-12  8:01 ` Jean Delvare
2008-12-12 21:45 ` Andy Poling
2008-12-14 15:44 ` Jean Delvare
2008-12-16  9:05 ` Andy Poling
2008-12-16  9:29 ` Jean Delvare

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.