Western Digital Scorpio and ICH10R on Debian

All of lore.kernel.org
 help / color / mirror / Atom feed

* Western Digital Scorpio and ICH10R on Debian - NCQ issue?
@ 2011-07-12 16:21 Sandra Escandor
  2011-07-16  1:16 ` Robert Hancock
  0 siblings, 1 reply; 6+ messages in thread
From: Sandra Escandor @ 2011-07-12 16:21 UTC (permalink / raw)
  To: linux-ide

The Situation:
It appears that a WRITE FPDMA QUEUED failed command causes driver
timeouts - this in turn locks up the RAID (which once worked pretty
well). This occurred during high I/O.

The question:
1. Is it a good idea to turn off NCQ? I've read in different posts that
it helps some, but not others - I'm currently on the way to getting an
experimental box setup, but I wanted to confirm if this was a good idea.
2. Are there known issues with the ICH10R + WD7500BPKT-00PK4T0 and the
libata driver?

The System:
Four WDC WD7500BPKT-00PK4T0 drives (Western Digital Scorpio) - in RAID10
array created using mdadm 3.1.4
ICH10R sata controller.
Kernel 2.6.32-5-amd64


The relevant kernel logs:

Jul  8 14:48:06 ecs-1u kernel: [ 8200.901003] ata3.00: exception Emask
0x0 SAct 0x1ffc0 SErr 0x0 action 0x6 frozen Jul  8 14:48:06 ecs-1u
kernel: [ 8200.901052] ata3.00: failed command:
WRITE FPDMA QUEUED
Jul  8 14:48:06 ecs-1u kernel: [ 8200.901082] ata3.00: cmd
61/00:30:80:37:3f/04:00:44:00:00/40 tag 6 ncq 524288 out
Jul  8 14:48:06 ecs-1u kernel: [ 8200.901083]          res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jul  8 14:48:06
ecs-1u kernel: [ 8200.901163] ata3.00: status: { DRDY } Jul  8 14:48:06
ecs-1u kernel: [ 8200.901183] ata3.00: failed command:
WRITE FPDMA QUEUED
Jul  8 14:48:06 ecs-1u kernel: [ 8200.901207] ata3.00: cmd
61/00:38:80:3b:3f/04:00:44:00:00/40 tag 7 ncq 524288 out
Jul  8 14:48:06 ecs-1u kernel: [ 8200.901208]          res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jul  8 14:48:06
ecs-1u kernel: [ 8200.901282] ata3.00: status: { DRDY } Jul  8 14:48:06
ecs-1u kernel: [ 8200.901302] ata3.00: failed command:
WRITE FPDMA QUEUED
Jul  8 14:48:06 ecs-1u kernel: [ 8200.901326] ata3.00: cmd
61/00:40:80:3f:3f/04:00:44:00:00/40 tag 8 ncq 524288 out
Jul  8 14:48:06 ecs-1u kernel: [ 8200.901327]          res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jul  8 14:48:06
ecs-1u kernel: [ 8200.901400] ata3.00: status: { DRDY } Jul  8 14:48:06
ecs-1u kernel: [ 8200.901420] ata3.00: failed command:
WRITE FPDMA QUEUED
Jul  8 14:48:06 ecs-1u kernel: [ 8200.901444] ata3.00: cmd
61/00:48:80:43:3f/04:00:44:00:00/40 tag 9 ncq 524288 out
Jul  8 14:48:06 ecs-1u kernel: [ 8200.901445]          res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jul  8 14:48:06
ecs-1u kernel: [ 8200.901525] ata3.00: status: { DRDY } Jul  8 14:48:06
ecs-1u kernel: [ 8200.901545] ata3.00: failed command:
WRITE FPDMA QUEUED
Jul  8 14:48:06 ecs-1u kernel: [ 8200.901569] ata3.00: cmd
61/00:50:80:47:3f/04:00:44:00:00/40 tag 10 ncq 524288 out
Jul  8 14:48:06 ecs-1u kernel: [ 8200.901570]          res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jul  8 14:48:06
ecs-1u kernel: [ 8200.901644] ata3.00: status: { DRDY } Jul  8 14:48:06
ecs-1u kernel: [ 8200.901664] ata3.00: failed command:
WRITE FPDMA QUEUED
Jul  8 14:48:06 ecs-1u kernel: [ 8200.901688] ata3.00: cmd
61/00:58:80:4b:3f/04:00:44:00:00/40 tag 11 ncq 524288 out
Jul  8 14:48:06 ecs-1u kernel: [ 8200.901689]          res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jul  8 14:48:06
ecs-1u kernel: [ 8200.901763] ata3.00: status: { DRDY } Jul  8 14:48:06
ecs-1u kernel: [ 8200.901783] ata3.00: failed command:
WRITE FPDMA QUEUED
Jul  8 14:48:06 ecs-1u kernel: [ 8200.901807] ata3.00: cmd
61/00:60:80:4f:3f/04:00:44:00:00/40 tag 12 ncq 524288 out
Jul  8 14:48:06 ecs-1u kernel: [ 8200.901808]          res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jul  8 14:48:06
ecs-1u kernel: [ 8200.901882] ata3.00: status: { DRDY } Jul  8 14:48:06
ecs-1u kernel: [ 8200.901902] ata3.00: failed command:
WRITE FPDMA QUEUED
Jul  8 14:48:06 ecs-1u kernel: [ 8200.901926] ata3.00: cmd
61/00:68:80:53:3f/04:00:44:00:00/40 tag 13 ncq 524288 out
Jul  8 14:48:06 ecs-1u kernel: [ 8200.901927]          res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jul  8 14:48:06
ecs-1u kernel: [ 8200.902000] ata3.00: status: { DRDY } Jul  8 14:48:06
ecs-1u kernel: [ 8200.902020] ata3.00: failed command:
WRITE FPDMA QUEUED
Jul  8 14:48:06 ecs-1u kernel: [ 8200.902044] ata3.00: cmd
61/00:70:80:57:3f/04:00:44:00:00/40 tag 14 ncq 524288 out
Jul  8 14:48:06 ecs-1u kernel: [ 8200.902045]          res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jul  8 14:48:06
ecs-1u kernel: [ 8200.902119] ata3.00: status: { DRDY } Jul  8 14:48:06
ecs-1u kernel: [ 8200.902139] ata3.00: failed command:
WRITE FPDMA QUEUED
Jul  8 14:48:06 ecs-1u kernel: [ 8200.902163] ata3.00: cmd
61/00:78:80:5b:3f/04:00:44:00:00/40 tag 15 ncq 524288 out
Jul  8 14:48:06 ecs-1u kernel: [ 8200.902164]          res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jul  8 14:48:06
ecs-1u kernel: [ 8200.902238] ata3.00: status: { DRDY } Jul  8 14:48:06
ecs-1u kernel: [ 8200.902257] ata3.00: failed command:
WRITE FPDMA QUEUED
Jul  8 14:48:06 ecs-1u kernel: [ 8200.902281] ata3.00: cmd
61/10:80:70:ef:37/00:00:26:00:00/40 tag 16 ncq 8192 out
Jul  8 14:48:06 ecs-1u kernel: [ 8200.902282]          res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jul  8 14:48:06
ecs-1u kernel: [ 8200.902356] ata3.00: status: { DRDY } Jul  8 14:48:06
ecs-1u kernel: [ 8200.902378] ata3: hard resetting link Jul  8 14:48:11
ecs-1u kernel: [ 8206.257532] ata3: link is slow to respond, please be
patient (ready=0) Jul  8 14:48:16 ecs-1u kernel: [ 8210.902508] ata3:
COMRESET failed
(errno=-16)
Jul  8 14:48:16 ecs-1u kernel: [ 8210.902535] ata3: hard resetting link
Jul  8 14:48:21 ecs-1u kernel: [ 8216.259007] ata3: link is slow to
respond, please be patient (ready=0) Jul  8 14:48:21 ecs-1u kernel: [
8216.762685] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jul
8 14:48:21 ecs-1u kernel: [ 8216.769012] ata3.00: configured for
UDMA/133
Jul  8 14:48:21 ecs-1u kernel: [ 8216.769019] ata3.00: device reported
invalid CHS sector 0 Jul  8 14:48:21 ecs-1u kernel: [ 8216.769024]
ata3.00: device reported invalid CHS sector 0 Jul  8 14:48:21 ecs-1u
kernel: [ 8216.769028] ata3.00: device reported invalid CHS sector 0 Jul
8 14:48:21 ecs-1u kernel: [ 8216.769032] ata3.00: device reported
invalid CHS sector 0 Jul  8 14:48:21 ecs-1u kernel: [ 8216.769036]
ata3.00: device reported invalid CHS sector 0 Jul  8 14:48:21 ecs-1u
kernel: [ 8216.769041] ata3.00: device reported invalid CHS sector 0 Jul
8 14:48:21 ecs-1u kernel: [ 8216.769045] ata3.00: device reported
invalid CHS sector 0 Jul  8 14:48:21 ecs-1u kernel: [ 8216.769049]
ata3.00: device reported invalid CHS sector 0 Jul  8 14:48:21 ecs-1u
kernel: [ 8216.769054] ata3.00: device reported invalid CHS sector 0 Jul
8 14:48:21 ecs-1u kernel: [ 8216.769058] ata3.00: device reported
invalid CHS sector 0 Jul  8 14:48:21 ecs-1u kernel: [ 8216.769060]
ata3.00: device reported invalid CHS sector 0 Jul  8 14:48:21 ecs-1u
kernel: [ 8216.769078] ata3: EH complete

Jul  8 14:57:19 ecs-1u kernel: [ 8753.699973] sd 2:0:0:0: [sdc]
Unhandled error code Jul  8 14:57:19 ecs-1u kernel: [ 8753.699975] sd
2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Jul  8 14:57:19 ecs-1u kernel:
[ 8753.699977] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 30 00 00 03 68 00 Jul  8 14:57:19 ecs-1u kernel:
[ 8753.699982] end_request: I/O error, dev sdc, sector 1053765632

Jul  8 14:57:23 ecs-1u kernel: [ 8758.163655] md: recovery of RAID array
md126
Jul  8 14:57:23 ecs-1u kernel: [ 8758.163660] md: minimum _guaranteed_
speed: 1000 KB/sec/disk.
Jul  8 14:57:23 ecs-1u kernel: [ 8758.163662] md: using maximum
available idle IO bandwidth (but not more than 200000 KB/sec) for
recovery.
Jul  8 14:57:23 ecs-1u kernel: [ 8758.163672] md: using 128k window,
over a total of 732572288 blocks.
Jul  8 14:57:23 ecs-1u kernel: [ 8758.163675] md: resuming recovery of
md126 from checkpoint.
Jul  8 14:57:23 ecs-1u kernel: [ 8758.163677] md: md126: recovery done.
Jul  8 14:57:23 ecs-1u kernel: [ 8758.296414] RAID10 conf printout:
Jul  8 14:57:23 ecs-1u kernel: [ 8758.296416]  --- wd:3 rd:4 Jul  8
14:57:23 ecs-1u kernel: [ 8758.296417]  disk 0, wo:0, o:1, dev:sdb Jul
8 14:57:23 ecs-1u kernel: [ 8758.296419]  disk 1, wo:1, o:0, dev:sdc Jul
8 14:57:23 ecs-1u kernel: [ 8758.296420]  disk 2, wo:0, o:1, dev:sdd Jul
8 14:57:23 ecs-1u kernel: [ 8758.296421]  disk 3, wo:0, o:1, dev:sde

Jul  8 14:58:17 ecs-1u kernel: [ 8812.088705] sd 2:0:0:0: [sdc]
Unhandled error code Jul  8 14:58:17 ecs-1u kernel: [ 8812.088710] sd
2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Jul  8 14:58:17 ecs-1u kernel:
[ 8812.088714] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 63 00 00 04 00 00 Jul  8 14:58:17 ecs-1u kernel:
[ 8812.088723] end_request: I/O error, dev sdc, sector 1053778688 Jul  8
14:58:17 ecs-1u kernel: [ 8812.088775] sd 2:0:0:0: [sdc] Unhandled error
code Jul  8 14:58:17 ecs-1u kernel: [ 8812.088776] sd 2:0:0:0: [sdc]
Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Jul  8 14:58:17 ecs-1u kernel:
[ 8812.088778] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 67 00 00 04 00 00 Jul  8 14:58:17 ecs-1u kernel:
[ 8812.088781] end_request: I/O error, dev sdc, sector 1053779712 Jul  8
14:58:17 ecs-1u kernel: [ 8812.088817] sd 2:0:0:0: [sdc] Unhandled error
code Jul  8 14:58:17 ecs-1u kernel: [ 8812.088818] sd 2:0:0:0: [sdc]
Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Jul  8 14:58:17 ecs-1u kernel:
[ 8812.088820] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 6b 00 00 04 00 00 Jul  8 14:58:17 ecs-1u kernel:
[ 8812.088823] end_request: I/O error, dev sdc, sector 1053780736 Jul  8
14:58:17 ecs-1u kernel: [ 8812.088859] sd 2:0:0:0: [sdc] Unhandled error
code Jul  8 14:58:17 ecs-1u kernel: [ 8812.088860] sd 2:0:0:0: [sdc]
Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Jul  8 14:58:17 ecs-1u kernel:
[ 8812.088862] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 6f 00 00 04 00 00 Jul  8 14:58:17 ecs-1u kernel:
[ 8812.088865] end_request: I/O error, dev sdc, sector 1053781760 Jul  8
14:58:17 ecs-1u kernel: [ 8812.088909] sd 2:0:0:0: [sdc] Unhandled error
code Jul  8 14:58:17 ecs-1u kernel: [ 8812.088910] sd 2:0:0:0: [sdc]
Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Jul  8 14:58:17 ecs-1u kernel:
[ 8812.088912] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 73 00 00 04 00 00 Jul  8 14:58:17 ecs-1u kernel:
[ 8812.088916] end_request: I/O error, dev sdc, sector 1053782784 Jul  8
14:58:17 ecs-1u kernel: [ 8812.089014] sd 2:0:0:0: [sdc] Unhandled error
code Jul  8 14:58:17 ecs-1u kernel: [ 8812.089015] sd 2:0:0:0: [sdc]
Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Jul  8 14:58:17 ecs-1u kernel:
[ 8812.089017] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 77 00 00 04 00 00 Jul  8 14:58:17 ecs-1u kernel:
[ 8812.089020] end_request: I/O error, dev sdc, sector 1053783808 Jul  8
14:58:17 ecs-1u kernel: [ 8812.089121] sd 2:0:0:0: [sdc] Unhandled error
code Jul  8 14:58:17 ecs-1u kernel: [ 8812.089122] sd 2:0:0:0: [sdc]
Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Jul  8 14:58:17 ecs-1u kernel:
[ 8812.089124] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 7b 00 00 04 00 00 Jul  8 14:58:17 ecs-1u kernel:
[ 8812.089127] end_request: I/O error, dev sdc, sector 1053784832 Jul  8
14:58:17 ecs-1u kernel: [ 8812.089236] sd 2:0:0:0: [sdc] Unhandled error
code Jul  8 14:58:17 ecs-1u kernel: [ 8812.089237] sd 2:0:0:0: [sdc]
Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Jul  8 14:58:17 ecs-1u kernel:
[ 8812.089239] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 7f 00 00 04 00 00 Jul  8 14:58:17 ecs-1u kernel:
[ 8812.089243] end_request: I/O error, dev sdc, sector 1053785856 Jul  8
14:58:17 ecs-1u kernel: [ 8812.089344] sd 2:0:0:0: [sdc] Unhandled error
code Jul  8 14:58:17 ecs-1u kernel: [ 8812.089345] sd 2:0:0:0: [sdc]
Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Jul  8 14:58:17 ecs-1u kernel:
[ 8812.089347] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 83 00 00 04 00 00 Jul  8 14:58:17 ecs-1u kernel:
[ 8812.089351] end_request: I/O error, dev sdc, sector 1053786880 Jul  8
14:58:17 ecs-1u kernel: [ 8812.089441] sd 2:0:0:0: [sdc] Unhandled error
code Jul  8 14:58:17 ecs-1u kernel: [ 8812.089443] sd 2:0:0:0: [sdc]
Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Jul  8 14:58:17 ecs-1u kernel:
[ 8812.089444] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 87 00 00 04 00 00 Jul  8 14:58:17 ecs-1u kernel:
[ 8812.089448] end_request: I/O error, dev sdc, sector 1053787904 Jul  8
14:58:17 ecs-1u kernel: [ 8812.089536] sd 2:0:0:0: [sdc] Unhandled error
code Jul  8 14:58:17 ecs-1u kernel: [ 8812.089537] sd 2:0:0:0: [sdc]
Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Jul  8 14:58:17 ecs-1u kernel:
[ 8812.089538] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 8b 00 00 04 00 00 Jul  8 14:58:17 ecs-1u kernel:
[ 8812.089542] end_request: I/O error, dev sdc, sector 1053788928 Jul  8
14:58:17 ecs-1u kernel: [ 8812.089631] sd 2:0:0:0: [sdc] Unhandled error
code Jul  8 14:58:17 ecs-1u kernel: [ 8812.089632] sd 2:0:0:0: [sdc]
Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Jul  8 14:58:17 ecs-1u kernel:
[ 8812.089634] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 8f 00 00 04 00 00 Jul  8 14:58:17 ecs-1u kernel:
[ 8812.089637] end_request: I/O error, dev sdc, sector 1053789952 Jul  8
15:01:22 ecs-1u kernel: [ 8997.041839] INFO: task kthreadd:2 blocked for
more than 120 seconds.
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041867] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041905] kthreadd      D
0000000000000000     0     2      0 0x00000000
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041908]  ffff8801bf13aa60
0000000000000046 0000000000000000 ffff8801bf11d000 Jul  8 15:01:22
ecs-1u kernel: [ 8997.041911]  0000000000000400
0000000000003737 000000000000f9e0 ffff8801bf067fd8 Jul  8 15:01:22
ecs-1u kernel: [ 8997.041913]  0000000000015780 0000000000015780
ffff88033f028710 ffff88033f028a08 Jul  8 15:01:22 ecs-1u kernel: [
8997.041915] Call Trace:
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041925]  [<ffffffff810b41ed>] ?
sync_page+0x0/0x46
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041929]  [<ffffffff812fb0d2>] ?
io_schedule+0x73/0xb7
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041931]  [<ffffffff810b422e>] ?
sync_page+0x41/0x46
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041933]  [<ffffffff812fb5df>] ?
__wait_on_bit+0x41/0x70
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041935]  [<ffffffff810b43b2>] ?
wait_on_page_bit+0x6b/0x71
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041938]  [<ffffffff81064f38>] ?
wake_bit_function+0x0/0x23
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041943]  [<ffffffff810be14a>] ?
shrink_page_list+0x14e/0x623
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041948]  [<ffffffff8105a8e1>] ?
del_timer_sync+0xc/0x16
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041953]  [<ffffffff8101657d>] ?
read_tsc+0xa/0x20
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041955]  [<ffffffff812fb434>] ?
schedule_timeout+0xad/0xdd
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041958]  [<ffffffff8106c477>] ?
ktime_get_ts+0x68/0xb2
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041961]  [<ffffffff81099d36>] ?
delayacct_end+0x74/0x7f
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041963]  [<ffffffff810bd53b>] ?
isolate_pages_global+0x1a0/0x20f
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041965]  [<ffffffff81065009>] ?
finish_wait+0x35/0x60
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041967]  [<ffffffff81064f0a>] ?
autoremove_wake_function+0x0/0x2e
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041969]  [<ffffffff810bee20>] ?
shrink_list+0x528/0x767
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041971]  [<ffffffff810bf2df>] ?
shrink_zone+0x280/0x342
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041975]  [<ffffffff810c76e8>] ?
zone_statistics+0x3c/0x5d
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041977]  [<ffffffff810b8593>] ?
zone_watermark_ok+0x20/0xb1
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041979]  [<ffffffff810bf76a>] ?
zone_reclaim+0x276/0x357
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041981]  [<ffffffff810bd39b>] ?
isolate_pages_global+0x0/0x20f
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041983]  [<ffffffff810b8593>] ?
zone_watermark_ok+0x20/0xb1
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041985]  [<ffffffff810b98bc>] ?
get_page_from_freelist+0x1ff/0x760
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041987]  [<ffffffff810ba184>] ?
__alloc_pages_nodemask+0x11c/0x5f4
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041994]  [<ffffffff8118e316>] ?
cpumask_next_and+0x2a/0x3a
Jul  8 15:01:22 ecs-1u kernel: [ 8997.041998]  [<ffffffff810453c3>] ?
find_busiest_group+0x9ae/0xa1e
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042001]  [<ffffffff81062afe>] ?
alloc_pid+0x26e/0x390
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042003]  [<ffffffff810b95c0>] ?
__get_free_pages+0x9/0x46
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042005]  [<ffffffff8104c506>] ?
copy_process+0xd7/0x115f
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042007]  [<ffffffff8104d6e5>] ?
do_fork+0x157/0x31e
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042009]  [<ffffffff81048261>] ?
finish_task_switch+0x3a/0xaf
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042012]  [<ffffffff81011b42>] ?
kernel_thread+0x82/0xe0
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042014]  [<ffffffff81064bc4>] ?
kthread+0x0/0x81
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042015]  [<ffffffff81011ba0>] ?
child_rip+0x0/0x20
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042017]  [<ffffffff81064b89>] ?
kthreadd+0xb1/0xec
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042021]  [<ffffffff814f5140>] ?
early_idt_handler+0x0/0x71
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042022]  [<ffffffff81011baa>] ?
child_rip+0xa/0x20
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042024]  [<ffffffff814f5140>] ?
early_idt_handler+0x0/0x71
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042028]  [<ffffffff810e01b1>] ?
do_set_mempolicy+0x128/0x13a
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042029]  [<ffffffff81064ad8>] ?
kthreadd+0x0/0xec
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042031]  [<ffffffff81011ba0>] ?
child_rip+0x0/0x20
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042076] INFO: task
md126_raid10:3493 blocked for more than 120 seconds.
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042101] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042138] md126_raid10  D
0000000000000000     0  3493      2 0x00000000
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042140]  ffff88033f02b880
0000000000000046 0000000000000000 0000000a00000006 Jul  8 15:01:22
ecs-1u kernel: [ 8997.042143]  0000006cffffffff
ffff880006e0fa98 000000000000f9e0 ffff88033df07fd8 Jul  8 15:01:22
ecs-1u kernel: [ 8997.042145]  0000000000015780 0000000000015780
ffff88033e79aa60 ffff88033e79ad58 Jul  8 15:01:22 ecs-1u kernel: [
8997.042147] Call Trace:
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042150]  [<ffffffff811951d6>] ?
sprintf+0x51/0x59
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042152]  [<ffffffff810414f5>] ?
select_task_rq_fair+0x472/0x836
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042154]  [<ffffffff812fb3b5>] ?
schedule_timeout+0x2e/0xdd
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042156]  [<ffffffff812fb26c>] ?
wait_for_common+0xde/0x15b
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042158]  [<ffffffff8104a440>] ?
default_wake_function+0x0/0x9
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042163]  [<ffffffff81064d7a>] ?
kthread_create+0x93/0x121
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042167]  [<ffffffffa0168764>] ?
md_thread+0x0/0x10f [md_mod]
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042172]  [<ffffffff810e7fb9>] ?
__kmalloc+0x12f/0x141
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042175]  [<ffffffffa01686ba>] ?
md_register_thread+0x22/0xcc [md_mod]
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042178]  [<ffffffffa0167510>] ?
md_do_sync+0x0/0xaf6 [md_mod]
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042181]  [<ffffffffa016872e>] ?
md_register_thread+0x96/0xcc [md_mod]
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042184]  [<ffffffffa016aee2>] ?
md_check_recovery+0x3fd/0x4b9 [md_mod]
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042187]  [<ffffffffa018116c>] ?
flush_pending_writes+0x13/0x8a [raid10]
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042190]  [<ffffffffa0181397>] ?
raid10d+0x42/0xade [raid10]
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042191]  [<ffffffff812faff8>] ?
thread_return+0x79/0xe0
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042194]  [<ffffffff8101166e>] ?
apic_timer_interrupt+0xe/0x20
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042196]  [<ffffffff812fb055>] ?
thread_return+0xd6/0xe0
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042197]  [<ffffffff812fb3b5>] ?
schedule_timeout+0x2e/0xdd
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042200]  [<ffffffffa0168855>] ?
md_thread+0xf1/0x10f [md_mod]
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042202]  [<ffffffff81064f0a>] ?
autoremove_wake_function+0x0/0x2e
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042205]  [<ffffffffa0168764>] ?
md_thread+0x0/0x10f [md_mod]
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042206]  [<ffffffff81064c3d>] ?
kthread+0x79/0x81
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042208]  [<ffffffff81011baa>] ?
child_rip+0xa/0x20
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042210]  [<ffffffff81064bc4>] ?
kthread+0x0/0x81
Jul  8 15:01:22 ecs-1u kernel: [ 8997.042211]  [<ffffffff81011ba0>] ?
child_rip+0x0/0x20
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963652] INFO: task kthreadd:2
blocked for more than 120 seconds.
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963680] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963718] kthreadd      D
0000000000000000     0     2      0 0x00000000
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963721]  ffff8801bf13aa60
0000000000000046 0000000000000000 ffff8801bf11d000 Jul  8 15:03:22
ecs-1u kernel: [ 9116.963723]  0000000000000400
0000000000003737 000000000000f9e0 ffff8801bf067fd8 Jul  8 15:03:22
ecs-1u kernel: [ 9116.963726]  0000000000015780 0000000000015780
ffff88033f028710 ffff88033f028a08 Jul  8 15:03:22 ecs-1u kernel: [
9116.963728] Call Trace:
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963737]  [<ffffffff810b41ed>] ?
sync_page+0x0/0x46
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963742]  [<ffffffff812fb0d2>] ?
io_schedule+0x73/0xb7
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963744]  [<ffffffff810b422e>] ?
sync_page+0x41/0x46
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963746]  [<ffffffff812fb5df>] ?
__wait_on_bit+0x41/0x70
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963748]  [<ffffffff810b43b2>] ?
wait_on_page_bit+0x6b/0x71
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963752]  [<ffffffff81064f38>] ?
wake_bit_function+0x0/0x23
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963755]  [<ffffffff810be14a>] ?
shrink_page_list+0x14e/0x623
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963760]  [<ffffffff8105a8e1>] ?
del_timer_sync+0xc/0x16
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963765]  [<ffffffff8101657d>] ?
read_tsc+0xa/0x20
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963766]  [<ffffffff812fb434>] ?
schedule_timeout+0xad/0xdd
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963769]  [<ffffffff8106c477>] ?
ktime_get_ts+0x68/0xb2
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963772]  [<ffffffff81099d36>] ?
delayacct_end+0x74/0x7f
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963774]  [<ffffffff810bd53b>] ?
isolate_pages_global+0x1a0/0x20f
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963776]  [<ffffffff81065009>] ?
finish_wait+0x35/0x60
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963778]  [<ffffffff81064f0a>] ?
autoremove_wake_function+0x0/0x2e
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963780]  [<ffffffff810bee20>] ?
shrink_list+0x528/0x767
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963783]  [<ffffffff810bf2df>] ?
shrink_zone+0x280/0x342
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963786]  [<ffffffff810c76e8>] ?
zone_statistics+0x3c/0x5d
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963788]  [<ffffffff810b8593>] ?
zone_watermark_ok+0x20/0xb1
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963790]  [<ffffffff810bf76a>] ?
zone_reclaim+0x276/0x357
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963792]  [<ffffffff810bd39b>] ?
isolate_pages_global+0x0/0x20f
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963794]  [<ffffffff810b8593>] ?
zone_watermark_ok+0x20/0xb1
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963796]  [<ffffffff810b98bc>] ?
get_page_from_freelist+0x1ff/0x760
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963798]  [<ffffffff810ba184>] ?
__alloc_pages_nodemask+0x11c/0x5f4
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963804]  [<ffffffff8118e316>] ?
cpumask_next_and+0x2a/0x3a
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963808]  [<ffffffff810453c3>] ?
find_busiest_group+0x9ae/0xa1e
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963812]  [<ffffffff81062afe>] ?
alloc_pid+0x26e/0x390
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963813]  [<ffffffff810b95c0>] ?
__get_free_pages+0x9/0x46
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963816]  [<ffffffff8104c506>] ?
copy_process+0xd7/0x115f
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963818]  [<ffffffff8104d6e5>] ?
do_fork+0x157/0x31e
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963820]  [<ffffffff81048261>] ?
finish_task_switch+0x3a/0xaf
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963822]  [<ffffffff81011b42>] ?
kernel_thread+0x82/0xe0
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963824]  [<ffffffff81064bc4>] ?
kthread+0x0/0x81
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963825]  [<ffffffff81011ba0>] ?
child_rip+0x0/0x20
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963827]  [<ffffffff81064b89>] ?
kthreadd+0xb1/0xec
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963831]  [<ffffffff814f5140>] ?
early_idt_handler+0x0/0x71
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963833]  [<ffffffff81011baa>] ?
child_rip+0xa/0x20
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963835]  [<ffffffff814f5140>] ?
early_idt_handler+0x0/0x71
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963838]  [<ffffffff810e01b1>] ?
do_set_mempolicy+0x128/0x13a
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963840]  [<ffffffff81064ad8>] ?
kthreadd+0x0/0xec
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963842]  [<ffffffff81011ba0>] ?
child_rip+0x0/0x20
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963886] INFO: task
md126_raid10:3493 blocked for more than 120 seconds.
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963911] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963949] md126_raid10  D
0000000000000000     0  3493      2 0x00000000
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963951]  ffff88033f02b880
0000000000000046 0000000000000000 0000000a00000006 Jul  8 15:03:22
ecs-1u kernel: [ 9116.963953]  0000006cffffffff
ffff880006e0fa98 000000000000f9e0 ffff88033df07fd8 Jul  8 15:03:22
ecs-1u kernel: [ 9116.963955]  0000000000015780 0000000000015780
ffff88033e79aa60 ffff88033e79ad58 Jul  8 15:03:22 ecs-1u kernel: [
9116.963957] Call Trace:
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963961]  [<ffffffff811951d6>] ?
sprintf+0x51/0x59
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963963]  [<ffffffff810414f5>] ?
select_task_rq_fair+0x472/0x836
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963965]  [<ffffffff812fb3b5>] ?
schedule_timeout+0x2e/0xdd
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963967]  [<ffffffff812fb26c>] ?
wait_for_common+0xde/0x15b
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963969]  [<ffffffff8104a440>] ?
default_wake_function+0x0/0x9
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963973]  [<ffffffff81064d7a>] ?
kthread_create+0x93/0x121
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963977]  [<ffffffffa0168764>] ?
md_thread+0x0/0x10f [md_mod]
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963982]  [<ffffffff810e7fb9>] ?
__kmalloc+0x12f/0x141
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963985]  [<ffffffffa01686ba>] ?
md_register_thread+0x22/0xcc [md_mod]
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963988]  [<ffffffffa0167510>] ?
md_do_sync+0x0/0xaf6 [md_mod]
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963991]  [<ffffffffa016872e>] ?
md_register_thread+0x96/0xcc [md_mod]
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963994]  [<ffffffffa016aee2>] ?
md_check_recovery+0x3fd/0x4b9 [md_mod]
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963997]  [<ffffffffa018116c>] ?
flush_pending_writes+0x13/0x8a [raid10]
Jul  8 15:03:22 ecs-1u kernel: [ 9116.963999]  [<ffffffffa0181397>] ?
raid10d+0x42/0xade [raid10]
Jul  8 15:03:22 ecs-1u kernel: [ 9116.964001]  [<ffffffff812faff8>] ?
thread_return+0x79/0xe0
Jul  8 15:03:22 ecs-1u kernel: [ 9116.964003]  [<ffffffff8101166e>] ?
apic_timer_interrupt+0xe/0x20
Jul  8 15:03:22 ecs-1u kernel: [ 9116.964005]  [<ffffffff812fb055>] ?
thread_return+0xd6/0xe0
Jul  8 15:03:22 ecs-1u kernel: [ 9116.964007]  [<ffffffff812fb3b5>] ?
schedule_timeout+0x2e/0xdd
Jul  8 15:03:22 ecs-1u kernel: [ 9116.964010]  [<ffffffffa0168855>] ?
md_thread+0xf1/0x10f [md_mod]
Jul  8 15:03:22 ecs-1u kernel: [ 9116.964012]  [<ffffffff81064f0a>] ?
autoremove_wake_function+0x0/0x2e
Jul  8 15:03:22 ecs-1u kernel: [ 9116.964014]  [<ffffffffa0168764>] ?
md_thread+0x0/0x10f [md_mod]
Jul  8 15:03:22 ecs-1u kernel: [ 9116.964016]  [<ffffffff81064c3d>] ?
kthread+0x79/0x81
Jul  8 15:03:22 ecs-1u kernel: [ 9116.964018]  [<ffffffff81011baa>] ?
child_rip+0xa/0x20
Jul  8 15:03:22 ecs-1u kernel: [ 9116.964019]  [<ffffffff81064bc4>] ?
kthread+0x0/0x81
Jul  8 15:03:22 ecs-1u kernel: [ 9116.964021]  [<ffffffff81011ba0>] ?
child_rip+0x0/0x20
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885452] INFO: task kthreadd:2
blocked for more than 120 seconds.
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885477] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885515] kthreadd      D
0000000000000000     0     2      0 0x00000000
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885517]  ffff8801bf13aa60
0000000000000046 0000000000000000 ffff8801bf11d000 Jul  8 15:05:22
ecs-1u kernel: [ 9236.885519]  0000000000000400
0000000000003737 000000000000f9e0 ffff8801bf067fd8 Jul  8 15:05:22
ecs-1u kernel: [ 9236.885521]  0000000000015780 0000000000015780
ffff88033f028710 ffff88033f028a08 Jul  8 15:05:22 ecs-1u kernel: [
9236.885523] Call Trace:
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885527]  [<ffffffff810b41ed>] ?
sync_page+0x0/0x46
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885529]  [<ffffffff812fb0d2>] ?
io_schedule+0x73/0xb7
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885531]  [<ffffffff810b422e>] ?
sync_page+0x41/0x46
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885533]  [<ffffffff812fb5df>] ?
__wait_on_bit+0x41/0x70
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885535]  [<ffffffff810b43b2>] ?
wait_on_page_bit+0x6b/0x71
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885537]  [<ffffffff81064f38>] ?
wake_bit_function+0x0/0x23
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885539]  [<ffffffff810be14a>] ?
shrink_page_list+0x14e/0x623
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885542]  [<ffffffff8105a8e1>] ?
del_timer_sync+0xc/0x16
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885544]  [<ffffffff8101657d>] ?
read_tsc+0xa/0x20
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885545]  [<ffffffff812fb434>] ?
schedule_timeout+0xad/0xdd
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885547]  [<ffffffff8106c477>] ?
ktime_get_ts+0x68/0xb2
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885549]  [<ffffffff81099d36>] ?
delayacct_end+0x74/0x7f
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885551]  [<ffffffff810bd53b>] ?
isolate_pages_global+0x1a0/0x20f
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885553]  [<ffffffff81065009>] ?
finish_wait+0x35/0x60
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885554]  [<ffffffff81064f0a>] ?
autoremove_wake_function+0x0/0x2e
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885556]  [<ffffffff810bee20>] ?
shrink_list+0x528/0x767
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885559]  [<ffffffff810bf2df>] ?
shrink_zone+0x280/0x342
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885561]  [<ffffffff810c76e8>] ?
zone_statistics+0x3c/0x5d
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885563]  [<ffffffff810b8593>] ?
zone_watermark_ok+0x20/0xb1
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885565]  [<ffffffff810bf76a>] ?
zone_reclaim+0x276/0x357
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885567]  [<ffffffff810bd39b>] ?
isolate_pages_global+0x0/0x20f
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885568]  [<ffffffff810b8593>] ?
zone_watermark_ok+0x20/0xb1
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885570]  [<ffffffff810b98bc>] ?
get_page_from_freelist+0x1ff/0x760
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885573]  [<ffffffff810ba184>] ?
__alloc_pages_nodemask+0x11c/0x5f4
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885575]  [<ffffffff8118e316>] ?
cpumask_next_and+0x2a/0x3a
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885577]  [<ffffffff810453c3>] ?
find_busiest_group+0x9ae/0xa1e
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885579]  [<ffffffff81062afe>] ?
alloc_pid+0x26e/0x390
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885581]  [<ffffffff810b95c0>] ?
__get_free_pages+0x9/0x46
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885583]  [<ffffffff8104c506>] ?
copy_process+0xd7/0x115f
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885585]  [<ffffffff8104d6e5>] ?
do_fork+0x157/0x31e
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885587]  [<ffffffff81048261>] ?
finish_task_switch+0x3a/0xaf
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885589]  [<ffffffff81011b42>] ?
kernel_thread+0x82/0xe0
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885590]  [<ffffffff81064bc4>] ?
kthread+0x0/0x81
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885592]  [<ffffffff81011ba0>] ?
child_rip+0x0/0x20
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885594]  [<ffffffff81064b89>] ?
kthreadd+0xb1/0xec
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885596]  [<ffffffff814f5140>] ?
early_idt_handler+0x0/0x71
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885598]  [<ffffffff81011baa>] ?
child_rip+0xa/0x20
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885600]  [<ffffffff814f5140>] ?
early_idt_handler+0x0/0x71
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885602]  [<ffffffff810e01b1>] ?
do_set_mempolicy+0x128/0x13a
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885603]  [<ffffffff81064ad8>] ?
kthreadd+0x0/0xec
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885605]  [<ffffffff81011ba0>] ?
child_rip+0x0/0x20
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885616] INFO: task
md126_raid10:3493 blocked for more than 120 seconds.
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885641] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885678] md126_raid10  D
0000000000000000     0  3493      2 0x00000000
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885681]  ffff88033f02b880
0000000000000046 0000000000000000 0000000a00000006 Jul  8 15:05:22
ecs-1u kernel: [ 9236.885683]  0000006cffffffff
ffff880006e0fa98 000000000000f9e0 ffff88033df07fd8 Jul  8 15:05:22
ecs-1u kernel: [ 9236.885685]  0000000000015780 0000000000015780
ffff88033e79aa60 ffff88033e79ad58 Jul  8 15:05:22 ecs-1u kernel: [
9236.885687] Call Trace:
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885689]  [<ffffffff811951d6>] ?
sprintf+0x51/0x59
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885691]  [<ffffffff810414f5>] ?
select_task_rq_fair+0x472/0x836
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885692]  [<ffffffff812fb3b5>] ?
schedule_timeout+0x2e/0xdd
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885694]  [<ffffffff812fb26c>] ?
wait_for_common+0xde/0x15b
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885696]  [<ffffffff8104a440>] ?
default_wake_function+0x0/0x9
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885699]  [<ffffffff81064d7a>] ?
kthread_create+0x93/0x121
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885702]  [<ffffffffa0168764>] ?
md_thread+0x0/0x10f [md_mod]
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885705]  [<ffffffff810e7fb9>] ?
__kmalloc+0x12f/0x141
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885708]  [<ffffffffa01686ba>] ?
md_register_thread+0x22/0xcc [md_mod]
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885711]  [<ffffffffa0167510>] ?
md_do_sync+0x0/0xaf6 [md_mod]
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885714]  [<ffffffffa016872e>] ?
md_register_thread+0x96/0xcc [md_mod]
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885716]  [<ffffffffa016aee2>] ?
md_check_recovery+0x3fd/0x4b9 [md_mod]
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885719]  [<ffffffffa018116c>] ?
flush_pending_writes+0x13/0x8a [raid10]
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885721]  [<ffffffffa0181397>] ?
raid10d+0x42/0xade [raid10]
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885723]  [<ffffffff812faff8>] ?
thread_return+0x79/0xe0
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885725]  [<ffffffff8101166e>] ?
apic_timer_interrupt+0xe/0x20
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885727]  [<ffffffff812fb055>] ?
thread_return+0xd6/0xe0
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885728]  [<ffffffff812fb3b5>] ?
schedule_timeout+0x2e/0xdd
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885731]  [<ffffffffa0168855>] ?
md_thread+0xf1/0x10f [md_mod]
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885733]  [<ffffffff81064f0a>] ?
autoremove_wake_function+0x0/0x2e
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885736]  [<ffffffffa0168764>] ?
md_thread+0x0/0x10f [md_mod]
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885738]  [<ffffffff81064c3d>] ?
kthread+0x79/0x81
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885739]  [<ffffffff81011baa>] ?
child_rip+0xa/0x20
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885741]  [<ffffffff81064bc4>] ?
kthread+0x0/0x81
Jul  8 15:05:22 ecs-1u kernel: [ 9236.885742]  [<ffffffff81011ba0>] ?
child_rip+0x0/0x20

....

Jul  8 15:07:22 ecs-1u kernel: [ 9356.807402] INFO: task
md126_raid10:3493 blocked for more than 120 seconds.
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807427] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807465] md126_raid10  D
0000000000000000     0  3493      2 0x00000000
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807467]  ffff88033f02b880
0000000000000046 0000000000000000 0000000a00000006 Jul  8 15:07:22
ecs-1u kernel: [ 9356.807469]  0000006cffffffff
ffff880006e0fa98 000000000000f9e0 ffff88033df07fd8 Jul  8 15:07:22
ecs-1u kernel: [ 9356.807471]  0000000000015780 0000000000015780
ffff88033e79aa60 ffff88033e79ad58 Jul  8 15:07:22 ecs-1u kernel: [
9356.807473] Call Trace:
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807475]  [<ffffffff811951d6>] ?
sprintf+0x51/0x59
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807477]  [<ffffffff810414f5>] ?
select_task_rq_fair+0x472/0x836
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807479]  [<ffffffff812fb3b5>] ?
schedule_timeout+0x2e/0xdd
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807481]  [<ffffffff812fb26c>] ?
wait_for_common+0xde/0x15b
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807483]  [<ffffffff8104a440>] ?
default_wake_function+0x0/0x9
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807485]  [<ffffffff81064d7a>] ?
kthread_create+0x93/0x121
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807488]  [<ffffffffa0168764>] ?
md_thread+0x0/0x10f [md_mod]
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807491]  [<ffffffff810e7fb9>] ?
__kmalloc+0x12f/0x141
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807494]  [<ffffffffa01686ba>] ?
md_register_thread+0x22/0xcc [md_mod]
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807497]  [<ffffffffa0167510>] ?
md_do_sync+0x0/0xaf6 [md_mod]
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807500]  [<ffffffffa016872e>] ?
md_register_thread+0x96/0xcc [md_mod]
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807503]  [<ffffffffa016aee2>] ?
md_check_recovery+0x3fd/0x4b9 [md_mod]
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807506]  [<ffffffffa018116c>] ?
flush_pending_writes+0x13/0x8a [raid10]
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807508]  [<ffffffffa0181397>] ?
raid10d+0x42/0xade [raid10]
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807510]  [<ffffffff812faff8>] ?
thread_return+0x79/0xe0
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807511]  [<ffffffff8101166e>] ?
apic_timer_interrupt+0xe/0x20
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807513]  [<ffffffff812fb055>] ?
thread_return+0xd6/0xe0
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807515]  [<ffffffff812fb3b5>] ?
schedule_timeout+0x2e/0xdd
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807518]  [<ffffffffa0168855>] ?
md_thread+0xf1/0x10f [md_mod]
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807520]  [<ffffffff81064f0a>] ?
autoremove_wake_function+0x0/0x2e
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807522]  [<ffffffffa0168764>] ?
md_thread+0x0/0x10f [md_mod]
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807524]  [<ffffffff81064c3d>] ?
kthread+0x79/0x81
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807526]  [<ffffffff81011baa>] ?
child_rip+0xa/0x20
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807527]  [<ffffffff81064bc4>] ?
kthread+0x0/0x81
Jul  8 15:07:22 ecs-1u kernel: [ 9356.807529]  [<ffffffff81011ba0>] ?
child_rip+0x0/0x20



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Western Digital Scorpio and ICH10R on Debian - NCQ issue?
  2011-07-12 16:21 Western Digital Scorpio and ICH10R on Debian - NCQ issue? Sandra Escandor
@ 2011-07-16  1:16 ` Robert Hancock
  2011-07-18 12:42   ` Sandra Escandor
  0 siblings, 1 reply; 6+ messages in thread
From: Robert Hancock @ 2011-07-16  1:16 UTC (permalink / raw)
  To: Sandra Escandor; +Cc: linux-ide

On 07/12/2011 10:21 AM, Sandra Escandor wrote:
> The Situation:
> It appears that a WRITE FPDMA QUEUED failed command causes driver
> timeouts - this in turn locks up the RAID (which once worked pretty
> well). This occurred during high I/O.
>
> The question:
> 1. Is it a good idea to turn off NCQ? I've read in different posts that
> it helps some, but not others - I'm currently on the way to getting an
> experimental box setup, but I wanted to confirm if this was a good idea.

Not really a solution to anything, at least not likely in this case. 
More of a workaround that might happen to work by chance.

> 2. Are there known issues with the ICH10R + WD7500BPKT-00PK4T0 and the
> libata driver?

Nothing known, no.

>
> The System:
> Four WDC WD7500BPKT-00PK4T0 drives (Western Digital Scorpio) - in RAID10
> array created using mdadm 3.1.4
> ICH10R sata controller.
> Kernel 2.6.32-5-amd64

The fact that you have multiple drives and the problem tends to occur 
during heavy I/O may point to a power issue. This has been known to 
happen when some of the drives aren't getting enough power when there 
are spikes in power draw during I/O access. In this case, using a 
beefier power supply or spreading the drives out across different cables 
from the PSU may help.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Western Digital Scorpio and ICH10R on Debian - NCQ issue?
  2011-07-16  1:16 ` Robert Hancock
@ 2011-07-18 12:42   ` Sandra Escandor
  2011-07-18 16:41     ` Robert Hancock
  0 siblings, 1 reply; 6+ messages in thread
From: Sandra Escandor @ 2011-07-18 12:42 UTC (permalink / raw)
  To: Robert Hancock; +Cc: linux-ide

Thanks for the insight Robert. Do you (or anyone else on the list) know
if there are any utilities that exist that would be able to allow me to
observe (and log) the power consumption of the drives during high I/O?

-----Original Message-----
From: Robert Hancock [mailto:hancockrwd@gmail.com] 
Sent: Friday, July 15, 2011 9:17 PM
To: Sandra Escandor
Cc: linux-ide@vger.kernel.org
Subject: Re: Western Digital Scorpio and ICH10R on Debian - NCQ issue?

On 07/12/2011 10:21 AM, Sandra Escandor wrote:
> The Situation:
> It appears that a WRITE FPDMA QUEUED failed command causes driver
> timeouts - this in turn locks up the RAID (which once worked pretty
> well). This occurred during high I/O.
>
> The question:
> 1. Is it a good idea to turn off NCQ? I've read in different posts
that
> it helps some, but not others - I'm currently on the way to getting an
> experimental box setup, but I wanted to confirm if this was a good
idea.

Not really a solution to anything, at least not likely in this case. 
More of a workaround that might happen to work by chance.

> 2. Are there known issues with the ICH10R + WD7500BPKT-00PK4T0 and the
> libata driver?

Nothing known, no.

>
> The System:
> Four WDC WD7500BPKT-00PK4T0 drives (Western Digital Scorpio) - in
RAID10
> array created using mdadm 3.1.4
> ICH10R sata controller.
> Kernel 2.6.32-5-amd64

The fact that you have multiple drives and the problem tends to occur 
during heavy I/O may point to a power issue. This has been known to 
happen when some of the drives aren't getting enough power when there 
are spikes in power draw during I/O access. In this case, using a 
beefier power supply or spreading the drives out across different cables

from the PSU may help.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Western Digital Scorpio and ICH10R on Debian - NCQ issue?
  2011-07-18 12:42   ` Sandra Escandor
@ 2011-07-18 16:41     ` Robert Hancock
  2011-07-19 13:20       ` Sandra Escandor
  0 siblings, 1 reply; 6+ messages in thread
From: Robert Hancock @ 2011-07-18 16:41 UTC (permalink / raw)
  To: Sandra Escandor; +Cc: linux-ide

On Mon, Jul 18, 2011 at 6:42 AM, Sandra Escandor <sescandor@evertz.com> wrote:
> Thanks for the insight Robert. Do you (or anyone else on the list) know
> if there are any utilities that exist that would be able to allow me to
> observe (and log) the power consumption of the drives during high I/O?

I don't think there's anything that you could do to measure this in
software. A clamp-on ammeter on one of the power supply wires would
give you a measurement, but it might not catch brief current spikes
that could be causing problems.

Usually these kinds of problems get fixed by trial and error (swapping
drives between cables, a different PSU).

>
> -----Original Message-----
> From: Robert Hancock [mailto:hancockrwd@gmail.com]
> Sent: Friday, July 15, 2011 9:17 PM
> To: Sandra Escandor
> Cc: linux-ide@vger.kernel.org
> Subject: Re: Western Digital Scorpio and ICH10R on Debian - NCQ issue?
>
> On 07/12/2011 10:21 AM, Sandra Escandor wrote:
>> The Situation:
>> It appears that a WRITE FPDMA QUEUED failed command causes driver
>> timeouts - this in turn locks up the RAID (which once worked pretty
>> well). This occurred during high I/O.
>>
>> The question:
>> 1. Is it a good idea to turn off NCQ? I've read in different posts
> that
>> it helps some, but not others - I'm currently on the way to getting an
>> experimental box setup, but I wanted to confirm if this was a good
> idea.
>
> Not really a solution to anything, at least not likely in this case.
> More of a workaround that might happen to work by chance.
>
>> 2. Are there known issues with the ICH10R + WD7500BPKT-00PK4T0 and the
>> libata driver?
>
> Nothing known, no.
>
>>
>> The System:
>> Four WDC WD7500BPKT-00PK4T0 drives (Western Digital Scorpio) - in
> RAID10
>> array created using mdadm 3.1.4
>> ICH10R sata controller.
>> Kernel 2.6.32-5-amd64
>
> The fact that you have multiple drives and the problem tends to occur
> during heavy I/O may point to a power issue. This has been known to
> happen when some of the drives aren't getting enough power when there
> are spikes in power draw during I/O access. In this case, using a
> beefier power supply or spreading the drives out across different cables
>
> from the PSU may help.
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Western Digital Scorpio and ICH10R on Debian - NCQ issue?
  2011-07-18 16:41     ` Robert Hancock
@ 2011-07-19 13:20       ` Sandra Escandor
  2011-07-19 14:46         ` Robert Hancock
  0 siblings, 1 reply; 6+ messages in thread
From: Sandra Escandor @ 2011-07-19 13:20 UTC (permalink / raw)
  To: Robert Hancock; +Cc: linux-ide

I was just reading over the kernel logs that I sent again, and I am
wondering if this might be a software issue instead, since the kernel
log shows that the drive that seems to time out is supposedly disabled
after disk failure (sdc was disabled by raid10 module, I think):

Jul  8 14:57:19 ecs-1u kernel: [ 8753.699104] sd 2:0:0:0: [sdc]
Unhandled error code
Jul  8 14:57:19 ecs-1u kernel: [ 8753.699107] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul  8 14:57:19 ecs-1u kernel: [ 8753.699110] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 18 00 00 04 00 00
Jul  8 14:57:19 ecs-1u kernel: [ 8753.699117] end_request: I/O error,
dev sdc, sector 1053759488
Jul  8 14:57:19 ecs-1u kernel: [ 8753.699144] raid10: Disk failure on
sdc, disabling device.
Jul  8 14:57:19 ecs-1u kernel: [ 8753.699144] raid10: Operation
continuing on 3 devices.

But then, a whole while later, there is an unhandled error code coming
from sdc - shouldn't we no longer get this now, since it was supposedly
disabled?

Jul  8 14:58:17 ecs-1u kernel: [ 8812.088705] sd 2:0:0:0: [sdc]
Unhandled error code
Jul  8 14:58:17 ecs-1u kernel: [ 8812.088710] sd 2:0:0:0: [sdc] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul  8 14:58:17 ecs-1u kernel: [ 8812.088714] sd 2:0:0:0: [sdc] CDB:
Write(10): 2a 00 3e cf 63 00 00 04 00 00
Jul  8 14:58:17 ecs-1u kernel: [ 8812.088723] end_request: I/O error,
dev sdc, sector 1053778688

Is the [sdc] output coming from libata still?

Thanks for your help on this, I feel like I've been stuck for a bit :)

-----Original Message-----
From: Robert Hancock [mailto:hancockrwd@gmail.com] 
Sent: Monday, July 18, 2011 12:41 PM
To: Sandra Escandor
Cc: linux-ide@vger.kernel.org
Subject: Re: Western Digital Scorpio and ICH10R on Debian - NCQ issue?

On Mon, Jul 18, 2011 at 6:42 AM, Sandra Escandor <sescandor@evertz.com>
wrote:
> Thanks for the insight Robert. Do you (or anyone else on the list)
know
> if there are any utilities that exist that would be able to allow me
to
> observe (and log) the power consumption of the drives during high I/O?

I don't think there's anything that you could do to measure this in
software. A clamp-on ammeter on one of the power supply wires would
give you a measurement, but it might not catch brief current spikes
that could be causing problems.

Usually these kinds of problems get fixed by trial and error (swapping
drives between cables, a different PSU).

>
> -----Original Message-----
> From: Robert Hancock [mailto:hancockrwd@gmail.com]
> Sent: Friday, July 15, 2011 9:17 PM
> To: Sandra Escandor
> Cc: linux-ide@vger.kernel.org
> Subject: Re: Western Digital Scorpio and ICH10R on Debian - NCQ issue?
>
> On 07/12/2011 10:21 AM, Sandra Escandor wrote:
>> The Situation:
>> It appears that a WRITE FPDMA QUEUED failed command causes driver
>> timeouts - this in turn locks up the RAID (which once worked pretty
>> well). This occurred during high I/O.
>>
>> The question:
>> 1. Is it a good idea to turn off NCQ? I've read in different posts
> that
>> it helps some, but not others - I'm currently on the way to getting
an
>> experimental box setup, but I wanted to confirm if this was a good
> idea.
>
> Not really a solution to anything, at least not likely in this case.
> More of a workaround that might happen to work by chance.
>
>> 2. Are there known issues with the ICH10R + WD7500BPKT-00PK4T0 and
the
>> libata driver?
>
> Nothing known, no.
>
>>
>> The System:
>> Four WDC WD7500BPKT-00PK4T0 drives (Western Digital Scorpio) - in
> RAID10
>> array created using mdadm 3.1.4
>> ICH10R sata controller.
>> Kernel 2.6.32-5-amd64
>
> The fact that you have multiple drives and the problem tends to occur
> during heavy I/O may point to a power issue. This has been known to
> happen when some of the drives aren't getting enough power when there
> are spikes in power draw during I/O access. In this case, using a
> beefier power supply or spreading the drives out across different
cables
>
> from the PSU may help.
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Western Digital Scorpio and ICH10R on Debian - NCQ issue?
  2011-07-19 13:20       ` Sandra Escandor
@ 2011-07-19 14:46         ` Robert Hancock
  0 siblings, 0 replies; 6+ messages in thread
From: Robert Hancock @ 2011-07-19 14:46 UTC (permalink / raw)
  To: Sandra Escandor; +Cc: linux-ide

On Tue, Jul 19, 2011 at 7:20 AM, Sandra Escandor <sescandor@evertz.com> wrote:
> I was just reading over the kernel logs that I sent again, and I am
> wondering if this might be a software issue instead, since the kernel
> log shows that the drive that seems to time out is supposedly disabled
> after disk failure (sdc was disabled by raid10 module, I think):
>
> Jul  8 14:57:19 ecs-1u kernel: [ 8753.699104] sd 2:0:0:0: [sdc]
> Unhandled error code
> Jul  8 14:57:19 ecs-1u kernel: [ 8753.699107] sd 2:0:0:0: [sdc] Result:
> hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
> Jul  8 14:57:19 ecs-1u kernel: [ 8753.699110] sd 2:0:0:0: [sdc] CDB:
> Write(10): 2a 00 3e cf 18 00 00 04 00 00
> Jul  8 14:57:19 ecs-1u kernel: [ 8753.699117] end_request: I/O error,
> dev sdc, sector 1053759488
> Jul  8 14:57:19 ecs-1u kernel: [ 8753.699144] raid10: Disk failure on
> sdc, disabling device.
> Jul  8 14:57:19 ecs-1u kernel: [ 8753.699144] raid10: Operation
> continuing on 3 devices.
>
> But then, a whole while later, there is an unhandled error code coming
> from sdc - shouldn't we no longer get this now, since it was supposedly
> disabled?

The RAID layer will "disable" the device after it gets an IO request
failure. However, some error handling by the SCSI or libata layers may
still be going on in the background, but the RAID layer doesn't want
to wait for that to finish.

>
> Jul  8 14:58:17 ecs-1u kernel: [ 8812.088705] sd 2:0:0:0: [sdc]
> Unhandled error code
> Jul  8 14:58:17 ecs-1u kernel: [ 8812.088710] sd 2:0:0:0: [sdc] Result:
> hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
> Jul  8 14:58:17 ecs-1u kernel: [ 8812.088714] sd 2:0:0:0: [sdc] CDB:
> Write(10): 2a 00 3e cf 63 00 00 04 00 00
> Jul  8 14:58:17 ecs-1u kernel: [ 8812.088723] end_request: I/O error,
> dev sdc, sector 1053778688
>
> Is the [sdc] output coming from libata still?
>
> Thanks for your help on this, I feel like I've been stuck for a bit :)
>
> -----Original Message-----
> From: Robert Hancock [mailto:hancockrwd@gmail.com]
> Sent: Monday, July 18, 2011 12:41 PM
> To: Sandra Escandor
> Cc: linux-ide@vger.kernel.org
> Subject: Re: Western Digital Scorpio and ICH10R on Debian - NCQ issue?
>
> On Mon, Jul 18, 2011 at 6:42 AM, Sandra Escandor <sescandor@evertz.com>
> wrote:
>> Thanks for the insight Robert. Do you (or anyone else on the list)
> know
>> if there are any utilities that exist that would be able to allow me
> to
>> observe (and log) the power consumption of the drives during high I/O?
>
> I don't think there's anything that you could do to measure this in
> software. A clamp-on ammeter on one of the power supply wires would
> give you a measurement, but it might not catch brief current spikes
> that could be causing problems.
>
> Usually these kinds of problems get fixed by trial and error (swapping
> drives between cables, a different PSU).
>
>>
>> -----Original Message-----
>> From: Robert Hancock [mailto:hancockrwd@gmail.com]
>> Sent: Friday, July 15, 2011 9:17 PM
>> To: Sandra Escandor
>> Cc: linux-ide@vger.kernel.org
>> Subject: Re: Western Digital Scorpio and ICH10R on Debian - NCQ issue?
>>
>> On 07/12/2011 10:21 AM, Sandra Escandor wrote:
>>> The Situation:
>>> It appears that a WRITE FPDMA QUEUED failed command causes driver
>>> timeouts - this in turn locks up the RAID (which once worked pretty
>>> well). This occurred during high I/O.
>>>
>>> The question:
>>> 1. Is it a good idea to turn off NCQ? I've read in different posts
>> that
>>> it helps some, but not others - I'm currently on the way to getting
> an
>>> experimental box setup, but I wanted to confirm if this was a good
>> idea.
>>
>> Not really a solution to anything, at least not likely in this case.
>> More of a workaround that might happen to work by chance.
>>
>>> 2. Are there known issues with the ICH10R + WD7500BPKT-00PK4T0 and
> the
>>> libata driver?
>>
>> Nothing known, no.
>>
>>>
>>> The System:
>>> Four WDC WD7500BPKT-00PK4T0 drives (Western Digital Scorpio) - in
>> RAID10
>>> array created using mdadm 3.1.4
>>> ICH10R sata controller.
>>> Kernel 2.6.32-5-amd64
>>
>> The fact that you have multiple drives and the problem tends to occur
>> during heavy I/O may point to a power issue. This has been known to
>> happen when some of the drives aren't getting enough power when there
>> are spikes in power draw during I/O access. In this case, using a
>> beefier power supply or spreading the drives out across different
> cables
>>
>> from the PSU may help.
>>
>>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-07-19 14:46 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-12 16:21 Western Digital Scorpio and ICH10R on Debian - NCQ issue? Sandra Escandor
2011-07-16  1:16 ` Robert Hancock
2011-07-18 12:42   ` Sandra Escandor
2011-07-18 16:41     ` Robert Hancock
2011-07-19 13:20       ` Sandra Escandor
2011-07-19 14:46         ` Robert Hancock

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.