From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: Re: Driver SFC: Possible bug in LM87 temperature XFP detection code Date: Tue, 28 Apr 2009 14:44:04 +0000 Message-ID: <1240929844.10689.35.camel@localhost.localdomain> References: <1240911369.10689.20.camel@localhost.localdomain> <1240925799.3200.16.camel@achroite> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Cc: "netdev@vger.kernel.org" To: Ben Hutchings Return-path: Received: from lanfw001a.cxnet.dk ([87.72.215.196]:56827 "EHLO lanfw001a.cxnet.dk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753505AbZD1OoG (ORCPT ); Tue, 28 Apr 2009 10:44:06 -0400 In-Reply-To: <1240925799.3200.16.camel@achroite> Sender: netdev-owner@vger.kernel.org List-ID: On Tue, 2009-04-28 at 14:36 +0100, Ben Hutchings wrote: > On Tue, 2009-04-28 at 11:36 +0200, Jesper Dangaard Brouer wrote: > > Hi Ben, > > > > I have borrowed some SMC10GPCIe-XFP NICs directly from SMC for > > evaluation. The NICs uses a Solarflare Chip and the SFC driver. > > > > If unpluging the fiber cable I start getting these errors: > > > > +-------- > > sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 30:00) INTERNAL EXTERNAL > > sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY > > > > sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 30:00) INTERNAL EXTERNAL > > sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY > > > > sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 10:00) INTERNAL > > sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY > > +--------- > > > > Reading through the driver code (drivers/net/sfc/boards.c), this problem > > is related to temperature. > > Right. And the sensors are not polled while the link is up, on the > assumption that a temperature or voltage fault will cause the link to go > down, and because bit-banged I2C will reduce throughput slightly. In my situation the link does not go down due the temperature issue. > > The real issues is that I cannot get the device up and running again > > after lowering the temperature. Only if I unload and load the sfc > > driver, then I can get the device running again. > > > > I'm thinking perhaps there is missing a PHY power up again, after the > > temperature alarm has gone? > > We considered it most important to shut down the board to prevent or > mitigate damage, and did not implement any recovery beyond that. Im my case putting the PHY in PHY_MODE_LOW_POWER, does not help lowering the temperature. The errors are continous, until I apply "manual" airflow ;-) > > I'm using kernel 2.6.30-rc1-net-next-00664-gd93fe1a. > > > > > > To Ben; do you have anything you want me to try. Do you want to fix this > > you self, or can you give me some code hints or patches to try out? > > I don't intend to fix this myself. If you want to try implementing this > then you should start by looking at efx_monitor() in efx.c. However, I > think your time might be better spent in fixing the air flow in the > computer before the board is permanently damaged. I see you point, I don't want to damage the board... not sure I want to fix it then... Although in a production environment, I think the driver should support exchanging a failed XFP without rebooting the server. Then I also think that we should make the error message a bit more explicit, in order to warn people before the board is permanently damaged. I'll post a patch proposal as reply to this message... -- Med venlig hilsen / Best regards Jesper Brouer ComX Networks A/S Linux Network developer Cand. Scient Datalog / MSc. Author of http://adsl-optimizer.dk LinkedIn: http://www.linkedin.com/in/brouer