netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <hawk@comx.dk>
To: Ben Hutchings <bhutchings@solarflare.com>
Cc: "netdev@vger.kernel.org" <netdev@vger.kernel.org>
Subject: Re: Driver SFC: Possible bug in LM87 temperature XFP detection code
Date: Tue, 28 Apr 2009 14:44:04 +0000	[thread overview]
Message-ID: <1240929844.10689.35.camel@localhost.localdomain> (raw)
In-Reply-To: <1240925799.3200.16.camel@achroite>

On Tue, 2009-04-28 at 14:36 +0100, Ben Hutchings wrote:
> On Tue, 2009-04-28 at 11:36 +0200, Jesper Dangaard Brouer wrote:
> > Hi Ben,
> > 
> > I have borrowed some SMC10GPCIe-XFP NICs directly from SMC for
> > evaluation.  The NICs uses a Solarflare Chip and the SFC driver.
> > 
> > If unpluging the fiber cable I start getting these errors:
> > 
> > +--------
> >  sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 30:00) INTERNAL EXTERNAL
> >  sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY
> > 
> >  sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 30:00) INTERNAL EXTERNAL
> >  sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY
> > 
> >  sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 10:00) INTERNAL
> >  sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY
> > +---------
> > 
> > Reading through the driver code (drivers/net/sfc/boards.c), this problem
> > is related to temperature.
> 
> Right.  And the sensors are not polled while the link is up, on the
> assumption that a temperature or voltage fault will cause the link to go
> down, and because bit-banged I2C will reduce throughput slightly.

In my situation the link does not go down due the temperature issue.


> > The real issues is that I cannot get the device up and running again
> > after lowering the temperature.  Only if I unload and load the sfc
> > driver, then I can get the device running again.
> > 
> > I'm thinking perhaps there is missing a PHY power up again, after the
> > temperature alarm has gone?
> 
> We considered it most important to shut down the board to prevent or
> mitigate damage, and did not implement any recovery beyond that.

Im my case putting the PHY in PHY_MODE_LOW_POWER, does not help lowering
the temperature.  The errors are continous, until I apply "manual"
airflow ;-)


> > I'm using kernel 2.6.30-rc1-net-next-00664-gd93fe1a.
> > 
> > 
> > To Ben; do you have anything you want me to try. Do you want to fix this
> > you self, or can you give me some code hints or patches to try out?
> 
> I don't intend to fix this myself.  If you want to try implementing this
> then you should start by looking at efx_monitor() in efx.c.  However, I
> think your time might be better spent in fixing the air flow in the
> computer before the board is permanently damaged.

I see you point, I don't want to damage the board... not sure I want to
fix it then... Although in a production environment, I think the driver
should support exchanging a failed XFP without rebooting the server.

Then I also think that we should make the error message a bit more
explicit, in order to warn people before the board is permanently
damaged.  I'll post a patch proposal as reply to this message...

-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network developer
  Cand. Scient Datalog / MSc.
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer


  reply	other threads:[~2009-04-28 14:44 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-28  9:36 Driver SFC: Possible bug in LM87 temperature XFP detection code Jesper Dangaard Brouer
2009-04-28 13:36 ` Ben Hutchings
2009-04-28 14:44   ` Jesper Dangaard Brouer [this message]
2009-04-28 14:48     ` [PATCH] sfc: Make temperature warnings/alarms more explicit Jesper Dangaard Brouer
2009-04-30  0:50       ` David Miller
2009-04-30  1:25       ` Ben Hutchings
2009-04-30  8:44         ` Jesper Dangaard Brouer
2009-04-28 17:04   ` Driver SFC: Possible bug in LM87 temperature XFP detection code Ben Hutchings
2009-04-29  8:52     ` Jesper Dangaard Brouer
2009-04-29 12:11       ` Jesper Dangaard Brouer
2009-04-29 12:47       ` Ben Hutchings

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1240929844.10689.35.camel@localhost.localdomain \
    --to=hawk@comx.dk \
    --cc=bhutchings@solarflare.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).