All of lore.kernel.org
 help / color / mirror / Atom feed
* dying hdd causing MCE and panic (libata)
@ 2008-04-20  9:33 Rumi Szabolcs
  2008-04-20  9:33 ` Alan Cox
  0 siblings, 1 reply; 4+ messages in thread
From: Rumi Szabolcs @ 2008-04-20  9:33 UTC (permalink / raw)
  To: linux-kernel

Hello all!

A SATA drive in one of my servers has made some final steps towards
the grave and it has put out some obvious signs of this onto the
console (ATA transactions failing) but then it has also thrown an
MCE (CPU context corrupt) and then the kernel has panicked.
This server is rock stable otherwise and used to make uptimes
measured in months between planned restarts.

The machine has been removed from power completely and restarted
multiple times but during the boot process it always crashed with
an MCE or a panic or both.

Sorry but I cannot provide exact debug information right now because
I wasn't physically there at the time and I'm still 250kms away from
that server. In fact I've remotely guided two people without a clue
through the phone and they have read things from the console for me,
restarted the machine, etc.

So in the end I told them to open up the server and pull the SATA
cable from that particular drive. Suddenly all the MCEs and panics
had gone away and the machine is running fine since then.

Hardware:

- Nforce4 based motherboard (chipset integrated SATA ports)
- Athlon64 single core CPU
- Diamondmax 9 SATA hard drive

Kernel:

2.6.23-gentoo-r3 (no preempt, no smp)

My questions:

- Is it normal that a simple hard disk failure (that is not even
the system disk) causes MCEs and kernel panics?

- Is this a problem that is induced completely on the hardware
level (eg. the southbridge going crazy and making the whole
hardware platform unstable) or a problem that could be fixed
or handled properly on the software (kernel) level?

Thanks!

Best regards,
Sab

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: dying hdd causing MCE and panic (libata)
  2008-04-20  9:33 Rumi Szabolcs
@ 2008-04-20  9:33 ` Alan Cox
  0 siblings, 0 replies; 4+ messages in thread
From: Alan Cox @ 2008-04-20  9:33 UTC (permalink / raw)
  To: Rumi Szabolcs; +Cc: linux-kernel

> - Is it normal that a simple hard disk failure (that is not even
> the system disk) causes MCEs and kernel panics?

No but with old style (pre AHCI) IDE it can because the system may MCE if
the CPU<->Disk hangs. If a drive is causing PSU problems it could also
occur I guess - ditto heat. You'd need to decode the MCE

> - Is this a problem that is induced completely on the hardware
> level (eg. the southbridge going crazy and making the whole
> hardware platform unstable) or a problem that could be fixed
> or handled properly on the software (kernel) level?

An MCE the hardware can recover from is reported and we continue. CPU
Context Corrupt means the processor internally set a "cannot carry on"
indicator.

Alan

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: dying hdd causing MCE and panic (libata)
       [not found] <fa.wEQ+jetKkQxiDyPBYbk6js8pT+0@ifi.uio.no>
@ 2008-04-20 18:21 ` Robert Hancock
  2008-04-22  9:33   ` Peer Chen
  0 siblings, 1 reply; 4+ messages in thread
From: Robert Hancock @ 2008-04-20 18:21 UTC (permalink / raw)
  To: Rumi Szabolcs; +Cc: linux-kernel, Peer Chen, Kuan Luo, Allen Martin

Rumi Szabolcs wrote:
> Hello all!
> 
> A SATA drive in one of my servers has made some final steps towards
> the grave and it has put out some obvious signs of this onto the
> console (ATA transactions failing) but then it has also thrown an
> MCE (CPU context corrupt) and then the kernel has panicked.
> This server is rock stable otherwise and used to make uptimes
> measured in months between planned restarts.
> 
> The machine has been removed from power completely and restarted
> multiple times but during the boot process it always crashed with
> an MCE or a panic or both.
> 
> Sorry but I cannot provide exact debug information right now because
> I wasn't physically there at the time and I'm still 250kms away from
> that server. In fact I've remotely guided two people without a clue
> through the phone and they have read things from the console for me,
> restarted the machine, etc.
> 
> So in the end I told them to open up the server and pull the SATA
> cable from that particular drive. Suddenly all the MCEs and panics
> had gone away and the machine is running fine since then.
> 
> Hardware:
> 
> - Nforce4 based motherboard (chipset integrated SATA ports)
> - Athlon64 single core CPU
> - Diamondmax 9 SATA hard drive
> 
> Kernel:
> 
> 2.6.23-gentoo-r3 (no preempt, no smp)
> 
> My questions:
> 
> - Is it normal that a simple hard disk failure (that is not even
> the system disk) causes MCEs and kernel panics?
> 
> - Is this a problem that is induced completely on the hardware
> level (eg. the southbridge going crazy and making the whole
> hardware platform unstable) or a problem that could be fixed
> or handled properly on the software (kernel) level?

It's known that nForce4 ADMA can in certain cases hang on error handling 
and cause an MCE when we attempt to switch the controller into register 
mode and read the  ATA registers in order to diagnose the problem. The 
MCE indicates the CPU timed out waiting for a register read from the 
chipset on the HyperTranport bus. It's not presently known why this is, 
or what we could do differently to avoid this problem. We're presently 
hampered by lack of public information from NVIDIA on this controller to 
fix this, so the ball is kind of in NVIDIA's court.

In the latest kernels sata_nv ADMA support is being disabled by default, 
which may prevent this from happening. However, some odd hotplug/error 
handling behavior was seen on these controllers before ADMA support was 
implemented, so it may not entirely fix the problem.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: dying hdd causing MCE and panic (libata)
  2008-04-20 18:21 ` dying hdd causing MCE and panic (libata) Robert Hancock
@ 2008-04-22  9:33   ` Peer Chen
  0 siblings, 0 replies; 4+ messages in thread
From: Peer Chen @ 2008-04-22  9:33 UTC (permalink / raw)
  To: Robert Hancock, Rumi Szabolcs; +Cc: linux-kernel, Kuan Luo, Allen Martin

It's unresolved machine check error and don't switch to register mode
and read the task file in the driver is the only way to avoid this
error.

BRs
Peer Chen
 

> -----Original Message-----
> From: Robert Hancock [mailto:hancockr@shaw.ca] 
> Sent: Monday, April 21, 2008 2:21 AM
> To: Rumi Szabolcs
> Cc: linux-kernel@vger.kernel.org; Peer Chen; Kuan Luo; Allen Martin
> Subject: Re: dying hdd causing MCE and panic (libata)
> 
> Rumi Szabolcs wrote:
> > Hello all!
> > 
> > A SATA drive in one of my servers has made some final steps towards 
> > the grave and it has put out some obvious signs of this onto the 
> > console (ATA transactions failing) but then it has also 
> thrown an MCE 
> > (CPU context corrupt) and then the kernel has panicked.
> > This server is rock stable otherwise and used to make 
> uptimes measured 
> > in months between planned restarts.
> > 
> > The machine has been removed from power completely and restarted 
> > multiple times but during the boot process it always 
> crashed with an 
> > MCE or a panic or both.
> > 
> > Sorry but I cannot provide exact debug information right 
> now because I 
> > wasn't physically there at the time and I'm still 250kms away from 
> > that server. In fact I've remotely guided two people without a clue 
> > through the phone and they have read things from the 
> console for me, 
> > restarted the machine, etc.
> > 
> > So in the end I told them to open up the server and pull the SATA 
> > cable from that particular drive. Suddenly all the MCEs and 
> panics had 
> > gone away and the machine is running fine since then.
> > 
> > Hardware:
> > 
> > - Nforce4 based motherboard (chipset integrated SATA ports)
> > - Athlon64 single core CPU
> > - Diamondmax 9 SATA hard drive
> > 
> > Kernel:
> > 
> > 2.6.23-gentoo-r3 (no preempt, no smp)
> > 
> > My questions:
> > 
> > - Is it normal that a simple hard disk failure (that is not 
> even the 
> > system disk) causes MCEs and kernel panics?
> > 
> > - Is this a problem that is induced completely on the 
> hardware level 
> > (eg. the southbridge going crazy and making the whole hardware 
> > platform unstable) or a problem that could be fixed or handled 
> > properly on the software (kernel) level?
> 
> It's known that nForce4 ADMA can in certain cases hang on 
> error handling and cause an MCE when we attempt to switch the 
> controller into register mode and read the  ATA registers in 
> order to diagnose the problem. The MCE indicates the CPU 
> timed out waiting for a register read from the chipset on the 
> HyperTranport bus. It's not presently known why this is, or 
> what we could do differently to avoid this problem. We're 
> presently hampered by lack of public information from NVIDIA 
> on this controller to fix this, so the ball is kind of in 
> NVIDIA's court.
> 
> In the latest kernels sata_nv ADMA support is being disabled 
> by default, which may prevent this from happening. However, 
> some odd hotplug/error handling behavior was seen on these 
> controllers before ADMA support was implemented, so it may 
> not entirely fix the problem.
> 
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-04-22  9:35 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <fa.wEQ+jetKkQxiDyPBYbk6js8pT+0@ifi.uio.no>
2008-04-20 18:21 ` dying hdd causing MCE and panic (libata) Robert Hancock
2008-04-22  9:33   ` Peer Chen
2008-04-20  9:33 Rumi Szabolcs
2008-04-20  9:33 ` Alan Cox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.