linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Machine hanging on synchronize cache on shutdown 2.6.22-rc4-git[45678]
@ 2007-06-16 11:52 Brad Campbell
  0 siblings, 0 replies; 6+ messages in thread
From: Brad Campbell @ 2007-06-16 11:52 UTC (permalink / raw)
  To: linux-ide, RAID Linux

G'day all,

I've got a box here based on current Debian Stable.
It's got 15 Maxtor SATA drives in it on 4 Promise TX4 controllers.

Using kernel 2.6.21.x it shuts down, but of course with a huge "clack" as 15 drives all do emergency 
head parks simultaneously. I thought I'd upgrade to 2.6.22-rc to get around this but the machine 
just hangs up hard apparently trying to sync cache on a drive.

I've run this process manually, so I know it is being performed properly.

Prior to shutdown, all nfsd processes are stopped, filesystems unmounted and md arrays stopped.
/proc/mdstat shows
root@storage1:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
unused devices: <none>
root@storage1:~#

Here is the final hangup.

http://www.fnarfbargle.com/CIMG1029.JPG

They keyboard is responsive up until the "hard resetting port" when it just locks solid and requires 
a power-cycle.

It's not really something I want to bisect if I can avoid it give the lovely noise it makes when I 
power it off. (this box only has a hard power switch, no reset or soft power buttons)

Brad
-- 
"Human beings, who are almost unique in having the ability
to learn from the experience of others, are also remarkable
for their apparent disinclination to do so." -- Douglas Adams

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Machine hanging on synchronize cache on shutdown 2.6.22-rc4-git[45678]
@ 2007-06-16 19:09 Mikael Pettersson
  2007-06-18  7:09 ` Tejun Heo
  0 siblings, 1 reply; 6+ messages in thread
From: Mikael Pettersson @ 2007-06-16 19:09 UTC (permalink / raw)
  To: brad, linux-ide, linux-raid

On Sat, 16 Jun 2007 15:52:33 +0400, Brad Campbell wrote:
> I've got a box here based on current Debian Stable.
> It's got 15 Maxtor SATA drives in it on 4 Promise TX4 controllers.
> 
> Using kernel 2.6.21.x it shuts down, but of course with a huge "clack" as 15 drives all do emergency 
> head parks simultaneously. I thought I'd upgrade to 2.6.22-rc to get around this but the machine 
> just hangs up hard apparently trying to sync cache on a drive.
> 
> I've run this process manually, so I know it is being performed properly.
> 
> Prior to shutdown, all nfsd processes are stopped, filesystems unmounted and md arrays stopped.
> /proc/mdstat shows
> root@storage1:~# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> unused devices: <none>
> root@storage1:~#
> 
> Here is the final hangup.
> 
> http://www.fnarfbargle.com/CIMG1029.JPG

Something sent a command to the disk on ata15 after the PHY had been
offlined and the interface had been put in SLUMBER state (SStatus 614).
Consequently the command timed out. Libata tried a soft reset, and then
a hard reset, after which the machine hung.

I don't think sata_promise is the guilty party here. Looks like some
layer above sata_promise got confused about the state of the interface.

I did a quick sata_promise test here with kernel 2.6.22-rc4-git8 and FC4
userspace, and there was no problem shutting the machine down.

/Mikael

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Machine hanging on synchronize cache on shutdown 2.6.22-rc4-git[45678]
  2007-06-16 19:09 Mikael Pettersson
@ 2007-06-18  7:09 ` Tejun Heo
  0 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2007-06-18  7:09 UTC (permalink / raw)
  To: Mikael Pettersson; +Cc: brad, linux-ide, linux-raid

Hello,

Mikael Pettersson wrote:
> On Sat, 16 Jun 2007 15:52:33 +0400, Brad Campbell wrote:
>> I've got a box here based on current Debian Stable.
>> It's got 15 Maxtor SATA drives in it on 4 Promise TX4 controllers.
>>
>> Using kernel 2.6.21.x it shuts down, but of course with a huge "clack" as 15 drives all do emergency 
>> head parks simultaneously. I thought I'd upgrade to 2.6.22-rc to get around this but the machine 
>> just hangs up hard apparently trying to sync cache on a drive.
>>
>> I've run this process manually, so I know it is being performed properly.
>>
>> Prior to shutdown, all nfsd processes are stopped, filesystems unmounted and md arrays stopped.
>> /proc/mdstat shows
>> root@storage1:~# cat /proc/mdstat
>> Personalities : [raid6] [raid5] [raid4]
>> unused devices: <none>
>> root@storage1:~#
>>
>> Here is the final hangup.
>>
>> http://www.fnarfbargle.com/CIMG1029.JPG
> 
> Something sent a command to the disk on ata15 after the PHY had been
> offlined and the interface had been put in SLUMBER state (SStatus 614).
> Consequently the command timed out. Libata tried a soft reset, and then
> a hard reset, after which the machine hung.

Hmm... weird.  Maybe device initiated power saving (DIPS) is active?

> I don't think sata_promise is the guilty party here. Looks like some
> layer above sata_promise got confused about the state of the interface.

But locking up hard after hardreset is a problem of sata_promise, no?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Machine hanging on synchronize cache on shutdown 2.6.22-rc4-git[45678]
@ 2007-06-18 11:29 Mikael Pettersson
  2007-06-18 11:33 ` Tejun Heo
  2007-06-18 11:40 ` Brad Campbell
  0 siblings, 2 replies; 6+ messages in thread
From: Mikael Pettersson @ 2007-06-18 11:29 UTC (permalink / raw)
  To: htejun, mikpe; +Cc: brad, linux-ide, linux-raid

On Mon, 18 Jun 2007 16:09:49 +0900, Tejun Heo wrote:
> Mikael Pettersson wrote:
> > On Sat, 16 Jun 2007 15:52:33 +0400, Brad Campbell wrote:
> >> I've got a box here based on current Debian Stable.
> >> It's got 15 Maxtor SATA drives in it on 4 Promise TX4 controllers.
> >>
> >> Using kernel 2.6.21.x it shuts down, but of course with a huge "clack" as 15 drives all do emergency 
> >> head parks simultaneously. I thought I'd upgrade to 2.6.22-rc to get around this but the machine 
> >> just hangs up hard apparently trying to sync cache on a drive.
> >>
> >> I've run this process manually, so I know it is being performed properly.
> >>
> >> Prior to shutdown, all nfsd processes are stopped, filesystems unmounted and md arrays stopped.
> >> /proc/mdstat shows
> >> root@storage1:~# cat /proc/mdstat
> >> Personalities : [raid6] [raid5] [raid4]
> >> unused devices: <none>
> >> root@storage1:~#
> >>
> >> Here is the final hangup.
> >>
> >> http://www.fnarfbargle.com/CIMG1029.JPG
> > 
> > Something sent a command to the disk on ata15 after the PHY had been
> > offlined and the interface had been put in SLUMBER state (SStatus 614).
> > Consequently the command timed out. Libata tried a soft reset, and then
> > a hard reset, after which the machine hung.
> 
> Hmm... weird.  Maybe device initiated power saving (DIPS) is active?
> 
> > I don't think sata_promise is the guilty party here. Looks like some
> > layer above sata_promise got confused about the state of the interface.
> 
> But locking up hard after hardreset is a problem of sata_promise, no?

Maybe, maybe not. The original report doesn't specify where/how
the machine hung.

Brad: can you enable sysrq and check if the kernel responds to
sysrq when it appears to hang, and if so, where it's executing?

sata_promise just passes sata_std_hardreset to ata_do_eh.
I've certainly seen EH hardresets work before, so I'm assuming
that something in this particular situation (PHY offlined,
kernel close to shutting down) breaks things.

FWIW, I'm seeing scsi layer accesses (cache flushes) after things
like rmmod sata_promise. They error out and don't seem to cause
any harm, but the fact that they occur at all makes me nervous.

/Mikael

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Machine hanging on synchronize cache on shutdown 2.6.22-rc4-git[45678]
  2007-06-18 11:29 Mikael Pettersson
@ 2007-06-18 11:33 ` Tejun Heo
  2007-06-18 11:40 ` Brad Campbell
  1 sibling, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2007-06-18 11:33 UTC (permalink / raw)
  To: Mikael Pettersson; +Cc: brad, linux-ide, linux-raid

Mikael Pettersson wrote:
> FWIW, I'm seeing scsi layer accesses (cache flushes) after things
> like rmmod sata_promise. They error out and don't seem to cause
> any harm, but the fact that they occur at all makes me nervous.

That's okay.  On rmmod, as the low level device (ATA) goes away first
just as in hot unplug, sd gets notified *after* the device is gone but
sd still tries to clean up and issues the commands which are properly
rejected by the SCSI midlayer as the device is marked offline already,
so nothing to worry about there.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Machine hanging on synchronize cache on shutdown 2.6.22-rc4-git[45678]
  2007-06-18 11:29 Mikael Pettersson
  2007-06-18 11:33 ` Tejun Heo
@ 2007-06-18 11:40 ` Brad Campbell
  1 sibling, 0 replies; 6+ messages in thread
From: Brad Campbell @ 2007-06-18 11:40 UTC (permalink / raw)
  To: Mikael Pettersson; +Cc: htejun, linux-ide, linux-raid

Mikael Pettersson wrote:

>>> I don't think sata_promise is the guilty party here. Looks like some
>>> layer above sata_promise got confused about the state of the interface.
>> But locking up hard after hardreset is a problem of sata_promise, no?
> 
> Maybe, maybe not. The original report doesn't specify where/how
> the machine hung.

It hangs in the process of trying to power it off. Unmount everything and halt the machine.

I've tried halt with and without the -h.

With the -h you can hear the drives spin down, then it tries to spin them up again and hangs.

Without the -h it just hangs hard where you see in the photo.

> Brad: can you enable sysrq and check if the kernel responds to
> sysrq when it appears to hang, and if so, where it's executing?

All my kernels have sysrq enabled. Once the hard reset is displayed on the screen everything locks.

> sata_promise just passes sata_std_hardreset to ata_do_eh.
> I've certainly seen EH hardresets work before, so I'm assuming
> that something in this particular situation (PHY offlined,
> kernel close to shutting down) breaks things.

That is my thought. I thought on a .22-rc kernel if I used halt -h and it spun the disks down that 
the kernel would detect that and not try to flush the caches on them, or have I read something 
incorrectly?

> FWIW, I'm seeing scsi layer accesses (cache flushes) after things
> like rmmod sata_promise. They error out and don't seem to cause
> any harm, but the fact that they occur at all makes me nervous.

Brad
-- 
"Human beings, who are almost unique in having the ability
to learn from the experience of others, are also remarkable
for their apparent disinclination to do so." -- Douglas Adams

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-06-18 11:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-16 11:52 Machine hanging on synchronize cache on shutdown 2.6.22-rc4-git[45678] Brad Campbell
  -- strict thread matches above, loose matches on Subject: below --
2007-06-16 19:09 Mikael Pettersson
2007-06-18  7:09 ` Tejun Heo
2007-06-18 11:29 Mikael Pettersson
2007-06-18 11:33 ` Tejun Heo
2007-06-18 11:40 ` Brad Campbell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).