Crash with Z77 chipset

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Crash with Z77 chipset
@ 2012-12-17 17:07 Andrius Narbutas
  2012-12-18  3:41 ` Robert Hancock
  0 siblings, 1 reply; 4+ messages in thread
From: Andrius Narbutas @ 2012-12-17 17:07 UTC (permalink / raw)
  To: linux-ide

[-- Attachment #1: Type: text/plain, Size: 3634 bytes --]

Hello,
(probably a bit long mail, but i will try to describe what i did or tried)
using ASRock Z77 Pro3 motherboard with Z77 chipset, 4xSATA WDC 
WD1002FAEX-00Z3A0 drives, Debian Linux (basic installation, no X or 
other services).
Problem: any intense I/O to disk causes system to crash. Easiest method 
(for me) to reproduce (100% so far) problem - just do mkfs.ext2 
/dev/sdb3 (any failsystem will work, the same goes for `dd if=/dev/zero 
of=/dev/sdb bs=1M`, just a bit slower). Before crash inode creation 
slows down, for ~10 seconds, then stops at all (and crash immediately).
What i tried:
  - first i noticed that system will crash with default debian kernel 
(2.6.32-5-amd64). This is only one kernel which writes something to 
message log, and crashes when writing inodes at count ~3250/7464. It 
writes info to /var/log/messages and console, system becomes 
unresponsive (kernel.panic from sysctl does not reboot system, same goes 
for software watchdog - you need to "manually" reboot system)
  - i recompiled current stable kernel (3.6.10) with 
CONFIG_DETECT_HUNG_TASK=y and CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y and 
re-tested. System hangs when writing inodes at ~3450/7464, no info on 
screen or syslog. System could be rebooted with `echo b > 
/proc/sysrq-trigger` on another console, console is responsive, but any 
disk access will hung console. Sometimes (rarely) system becomes 
unresponsive, and reboots after timeout
  - i recompiled todays git kernel, recompiled with the same parameters. 
It hangs ~6400/7464 (note - goes much further than previous versions), 
but completely - does not reboot itself, does not respond to ping, only 
poweroff helps. Nothing in syslog, photo from screen will be attached 
with logs in next post (can't be scrolled up/down - so no info what 
happened earlier)

Observations:
  - system could be "alive" and working with low disk activity for long 
time (at least, more than week). But enough to do some disk I/O - crash 
(for example, copying bzip'ed kernel image from one place to another is 
enough to trigger crash)
  - disk type does not matter. I tried to attach Hitachi HDS722020ALA330 
disk instead of WD - the same (i would say, it crashed even earlier, but 
didn't measured exactly)
  - SATA cables are replaced, system could run prime95 torture test for 
several hours - so i could say that RAM/CPU isn't a problem here
  - could be crashed with activity on any disk. I tried to make RAID10, 
LVM on top - crash; disassembled md array, tested with disk activity to 
_all_ disks separately - any disk activity could crash system
  - tested all "quick" solutions i could find on internet, including 
module params "acpi=off noapic", "libata.noacpi=1", 
"libata.force=1.5Gbps", some other woodoo magic like disabling write 
cache or disabling NCQ - no difference (probably tested something more, 
like 'norst', i forgot already)
Attached zip'ed logs - one from 2.6 kernel (with trace), another from 
today's git kernel (entire log from boot to crash, next line in log 
starts again with rsyslog...).
Also, screen images from "dead" system (nothing in logs, and i can't 
scroll up):
  - todays git kernel: http://i49.tinypic.com/js0xl2.jpg
  - 3.6.10 on shutdown (crashed): http://i47.tinypic.com/2exv4fr.jpg

Because this problem is easily reproducible - i could try to get as much 
information as i can, if you ask. Minor problem - i do not have physical 
access to system, so if tests should be done with latest kernel (which 
hangs completely and needs access to system for restart) - i can do 
tests only at day, when others could access and reboot system.

Thanks.


[-- Attachment #2: logs.zip --]
[-- Type: application/zip, Size: 11714 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Crash with Z77 chipset
  2012-12-17 17:07 Crash with Z77 chipset Andrius Narbutas
@ 2012-12-18  3:41 ` Robert Hancock
  2012-12-18  8:51   ` Andrius Narbutas
  0 siblings, 1 reply; 4+ messages in thread
From: Robert Hancock @ 2012-12-18  3:41 UTC (permalink / raw)
  To: Andrius Narbutas; +Cc: linux-ide

On 12/17/2012 11:07 AM, Andrius Narbutas wrote:
> Hello,
> (probably a bit long mail, but i will try to describe what i did or tried)
> using ASRock Z77 Pro3 motherboard with Z77 chipset, 4xSATA WDC
> WD1002FAEX-00Z3A0 drives, Debian Linux (basic installation, no X or
> other services).
> Problem: any intense I/O to disk causes system to crash. Easiest method
> (for me) to reproduce (100% so far) problem - just do mkfs.ext2
> /dev/sdb3 (any failsystem will work, the same goes for `dd if=/dev/zero
> of=/dev/sdb bs=1M`, just a bit slower). Before crash inode creation
> slows down, for ~10 seconds, then stops at all (and crash immediately).
> What i tried:

My first thought would be that a power problem is a possibility. These 
kinds of setups with multiple HDs in a RAID setup are known to cause 
these issues in some cases if the PSU isn't adequate. It tends to show 
up in situations like this where all hard drives are maxed out with disk 
activity and they all pull their peak power at the same time - if the 
voltage dips too low you can get problems with the SATA link dropping, etc.

You might want to try running with only one or two disks powered up, or 
try moving disks to different power cables, etc. to see if that affects 
the problem.

>   - first i noticed that system will crash with default debian kernel
> (2.6.32-5-amd64). This is only one kernel which writes something to
> message log, and crashes when writing inodes at count ~3250/7464. It
> writes info to /var/log/messages and console, system becomes
> unresponsive (kernel.panic from sysctl does not reboot system, same goes
> for software watchdog - you need to "manually" reboot system)
>   - i recompiled current stable kernel (3.6.10) with
> CONFIG_DETECT_HUNG_TASK=y and CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y and
> re-tested. System hangs when writing inodes at ~3450/7464, no info on
> screen or syslog. System could be rebooted with `echo b >
> /proc/sysrq-trigger` on another console, console is responsive, but any
> disk access will hung console. Sometimes (rarely) system becomes
> unresponsive, and reboots after timeout
>   - i recompiled todays git kernel, recompiled with the same parameters.
> It hangs ~6400/7464 (note - goes much further than previous versions),
> but completely - does not reboot itself, does not respond to ping, only
> poweroff helps. Nothing in syslog, photo from screen will be attached
> with logs in next post (can't be scrolled up/down - so no info what
> happened earlier)
>
> Observations:
>   - system could be "alive" and working with low disk activity for long
> time (at least, more than week). But enough to do some disk I/O - crash
> (for example, copying bzip'ed kernel image from one place to another is
> enough to trigger crash)
>   - disk type does not matter. I tried to attach Hitachi HDS722020ALA330
> disk instead of WD - the same (i would say, it crashed even earlier, but
> didn't measured exactly)
>   - SATA cables are replaced, system could run prime95 torture test for
> several hours - so i could say that RAM/CPU isn't a problem here
>   - could be crashed with activity on any disk. I tried to make RAID10,
> LVM on top - crash; disassembled md array, tested with disk activity to
> _all_ disks separately - any disk activity could crash system
>   - tested all "quick" solutions i could find on internet, including
> module params "acpi=off noapic", "libata.noacpi=1",
> "libata.force=1.5Gbps", some other woodoo magic like disabling write
> cache or disabling NCQ - no difference (probably tested something more,
> like 'norst', i forgot already)
> Attached zip'ed logs - one from 2.6 kernel (with trace), another from
> today's git kernel (entire log from boot to crash, next line in log
> starts again with rsyslog...).
> Also, screen images from "dead" system (nothing in logs, and i can't
> scroll up):
>   - todays git kernel: http://i49.tinypic.com/js0xl2.jpg
>   - 3.6.10 on shutdown (crashed): http://i47.tinypic.com/2exv4fr.jpg
>
> Because this problem is easily reproducible - i could try to get as much
> information as i can, if you ask. Minor problem - i do not have physical
> access to system, so if tests should be done with latest kernel (which
> hangs completely and needs access to system for restart) - i can do
> tests only at day, when others could access and reboot system.
>
> Thanks.
>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Crash with Z77 chipset
  2012-12-18  3:41 ` Robert Hancock
@ 2012-12-18  8:51   ` Andrius Narbutas
  2012-12-19  3:36     ` Robert Hancock
  0 siblings, 1 reply; 4+ messages in thread
From: Andrius Narbutas @ 2012-12-18  8:51 UTC (permalink / raw)
  To: linux-ide

On 2012.12.18 05:41, Robert Hancock wrote:
> My first thought would be that a power problem is a possibility. These
> kinds of setups with multiple HDs in a RAID setup are known to cause
> these issues in some cases if the PSU isn't adequate.

I do not think PSU is a problem, because:
1) All hard disks combined draw less energy than loaded CPU, even at 
heavy load (from HDD datasheet: "Read/Write: 6.80 Watts; Idle	6.10 
Watts" - difference is 0.7W per HDD, so < 3W combined, CPU draws ~40W 
when loaded, compared to idle). Loading CPU/RAM to max does not crash 
system at all
2) I'm planning power supplies at 2x needed power (you know, all those 
"Chinese Watt" system is unreliable). Anyway, should be more than enough 
for whole system (and CPU is almost at idle when creating filesystem, so 
load on PSU is very low - should be < 70W - that's almost nothing on 
560W PSU, even counting "Chinese Watt" coefficient)
3) If PSU is fault - why it fails at exact the same place? Most of 
hardware failures have some "random" factor - you get segfaults at 
random places from faulty RAM, crashes from dying PSU when doing random 
tasks... But now it fails at exactly the same place (when using the same 
kernel)
4) Let's say PSU is faulty. Then how comes, that with 3.6.10 kernel i 
still have control over system (when it crashes) - so only disk 
subsystem fails? Because it has only one 12V rail - you cannot 
disconnect disks from system, without killing motherboard power too. But 
after crash i still can do `ssh root@deadhost 'echo b > 
/proc/sysrq-trigger'` - so system is alive and working well (just disks 
are dead)

I could imagine that motherboard itself is faulty (well, interesting 
anyway - why it fails only on heavy I/O load), so i will try to get 
Windows Server installed to check if that will work.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Crash with Z77 chipset
  2012-12-18  8:51   ` Andrius Narbutas
@ 2012-12-19  3:36     ` Robert Hancock
  0 siblings, 0 replies; 4+ messages in thread
From: Robert Hancock @ 2012-12-19  3:36 UTC (permalink / raw)
  To: Andrius Narbutas; +Cc: linux-ide

On 12/18/2012 02:51 AM, Andrius Narbutas wrote:
> On 2012.12.18 05:41, Robert Hancock wrote:
>> My first thought would be that a power problem is a possibility. These
>> kinds of setups with multiple HDs in a RAID setup are known to cause
>> these issues in some cases if the PSU isn't adequate.
>
> I do not think PSU is a problem, because:
> 1) All hard disks combined draw less energy than loaded CPU, even at
> heavy load (from HDD datasheet: "Read/Write: 6.80 Watts; Idle    6.10
> Watts" - difference is 0.7W per HDD, so < 3W combined, CPU draws ~40W
> when loaded, compared to idle). Loading CPU/RAM to max does not crash
> system at all

The CPU has a voltage regulator in front of it which can compensate for 
dips in the input voltage. The disks don't. The wattage figures don't 
necessarily account for short-duration power draw peaks. And depending 
on how the drives are hooked up, especially if they are all on one 
cable, they can potentially see a problematic voltage drop.

> 2) I'm planning power supplies at 2x needed power (you know, all those
> "Chinese Watt" system is unreliable). Anyway, should be more than enough
> for whole system (and CPU is almost at idle when creating filesystem, so
> load on PSU is very low - should be < 70W - that's almost nothing on
> 560W PSU, even counting "Chinese Watt" coefficient)
> 3) If PSU is fault - why it fails at exact the same place? Most of
> hardware failures have some "random" factor - you get segfaults at
> random places from faulty RAM, crashes from dying PSU when doing random
> tasks... But now it fails at exactly the same place (when using the same
> kernel)
> 4) Let's say PSU is faulty. Then how comes, that with 3.6.10 kernel i
> still have control over system (when it crashes) - so only disk
> subsystem fails? Because it has only one 12V rail - you cannot
> disconnect disks from system, without killing motherboard power too. But
> after crash i still can do `ssh root@deadhost 'echo b >
> /proc/sysrq-trigger'` - so system is alive and working well (just disks
> are dead)
>
> I could imagine that motherboard itself is faulty (well, interesting
> anyway - why it fails only on heavy I/O load), so i will try to get
> Windows Server installed to check if that will work.



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-12-19  3:42 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-17 17:07 Crash with Z77 chipset Andrius Narbutas
2012-12-18  3:41 ` Robert Hancock
2012-12-18  8:51   ` Andrius Narbutas
2012-12-19  3:36     ` Robert Hancock

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).