From: Jesper Dangaard Brouer <hawk@comx.dk>
To: Ben Hutchings <bhutchings@solarflare.com>
Cc: "netdev@vger.kernel.org" <netdev@vger.kernel.org>
Subject: Re: Driver SFC: Possible bug in LM87 temperature XFP detection code
Date: Tue, 28 Apr 2009 14:44:04 +0000 [thread overview]
Message-ID: <1240929844.10689.35.camel@localhost.localdomain> (raw)
In-Reply-To: <1240925799.3200.16.camel@achroite>
On Tue, 2009-04-28 at 14:36 +0100, Ben Hutchings wrote:
> On Tue, 2009-04-28 at 11:36 +0200, Jesper Dangaard Brouer wrote:
> > Hi Ben,
> >
> > I have borrowed some SMC10GPCIe-XFP NICs directly from SMC for
> > evaluation. The NICs uses a Solarflare Chip and the SFC driver.
> >
> > If unpluging the fiber cable I start getting these errors:
> >
> > +--------
> > sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 30:00) INTERNAL EXTERNAL
> > sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY
> >
> > sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 30:00) INTERNAL EXTERNAL
> > sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY
> >
> > sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 10:00) INTERNAL
> > sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY
> > +---------
> >
> > Reading through the driver code (drivers/net/sfc/boards.c), this problem
> > is related to temperature.
>
> Right. And the sensors are not polled while the link is up, on the
> assumption that a temperature or voltage fault will cause the link to go
> down, and because bit-banged I2C will reduce throughput slightly.
In my situation the link does not go down due the temperature issue.
> > The real issues is that I cannot get the device up and running again
> > after lowering the temperature. Only if I unload and load the sfc
> > driver, then I can get the device running again.
> >
> > I'm thinking perhaps there is missing a PHY power up again, after the
> > temperature alarm has gone?
>
> We considered it most important to shut down the board to prevent or
> mitigate damage, and did not implement any recovery beyond that.
Im my case putting the PHY in PHY_MODE_LOW_POWER, does not help lowering
the temperature. The errors are continous, until I apply "manual"
airflow ;-)
> > I'm using kernel 2.6.30-rc1-net-next-00664-gd93fe1a.
> >
> >
> > To Ben; do you have anything you want me to try. Do you want to fix this
> > you self, or can you give me some code hints or patches to try out?
>
> I don't intend to fix this myself. If you want to try implementing this
> then you should start by looking at efx_monitor() in efx.c. However, I
> think your time might be better spent in fixing the air flow in the
> computer before the board is permanently damaged.
I see you point, I don't want to damage the board... not sure I want to
fix it then... Although in a production environment, I think the driver
should support exchanging a failed XFP without rebooting the server.
Then I also think that we should make the error message a bit more
explicit, in order to warn people before the board is permanently
damaged. I'll post a patch proposal as reply to this message...
--
Med venlig hilsen / Best regards
Jesper Brouer
ComX Networks A/S
Linux Network developer
Cand. Scient Datalog / MSc.
Author of http://adsl-optimizer.dk
LinkedIn: http://www.linkedin.com/in/brouer
next prev parent reply other threads:[~2009-04-28 14:44 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-04-28 9:36 Driver SFC: Possible bug in LM87 temperature XFP detection code Jesper Dangaard Brouer
2009-04-28 13:36 ` Ben Hutchings
2009-04-28 14:44 ` Jesper Dangaard Brouer [this message]
2009-04-28 14:48 ` [PATCH] sfc: Make temperature warnings/alarms more explicit Jesper Dangaard Brouer
2009-04-30 0:50 ` David Miller
2009-04-30 1:25 ` Ben Hutchings
2009-04-30 8:44 ` Jesper Dangaard Brouer
2009-04-28 17:04 ` Driver SFC: Possible bug in LM87 temperature XFP detection code Ben Hutchings
2009-04-29 8:52 ` Jesper Dangaard Brouer
2009-04-29 12:11 ` Jesper Dangaard Brouer
2009-04-29 12:47 ` Ben Hutchings
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1240929844.10689.35.camel@localhost.localdomain \
--to=hawk@comx.dk \
--cc=bhutchings@solarflare.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).