From: Jim Klimov <klimov@2ka.mipt.ru>
To: Jens Axboe <jens.axboe@oracle.com>
Cc: Jim Klimov <klimov@2ka.mipt.ru>, Linux RAID <linux-raid@vger.kernel.org>
Subject: Re[4]: RAID1 submirror failure causes reboot?
Date: Tue, 14 Nov 2006 13:36:12 +0300 [thread overview]
Message-ID: <77664873.20061114133612@2ka.mipt.ru> (raw)
In-Reply-To: <20061114072857.GI15031@kernel.dk>
Hello Jens,
JA> Then lets wait for Jim to repeat his testing with all the debugging
JA> options enabled, that should make us a little wiser.
Ok, I'll enable the kernel rebuilt with these options and report any
findings later. So far, I'll move to the other questions aroused.
I remember when I ran with most of debugging ON last week, it made the
server run darn slow, with LA around 40-60 and little responsiveness,
and numerous messages like:
Hangcheck: hangcheck value past margin!
BUG: soft lockup detected on CPU#1!
which hide the interesting traces :)
However, I can post these captured traces of several system lifetimes
to the list or privately.
Concerning other questions:
1) The workload on the software raid is rather small. It's a set of
system partitions which keep fileserver's logs, etc. The file
storage is on 3Ware cards and has substantial load. MD arrays are
checked nightly, though (echo check > sync_action), and most often
this triggers the problem. These drives only contain mirrored
partitions, so there should be no I/O to these drives around
MD, except for rare cases of lilo running :)
2) I have installed a third submirror disk this weekend, it's an IDE
slave device hdd (near the failing hdc). Since then I got errors
on other partitions, attached below as "2*)".
3) The failures which lead to reboots are usually preceded by a long
history of dma_intr errors 0x40 and 0x51, but that sample I sent
was rather full. A few errors preceded it every 5 seconds, making
the full trace like this:
[87319.049902] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
[87319.057393] hdc: dma_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315631
[87319.067205] ide: failed opcode was: unknown
[87323.956399] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
[87323.963681] hdc: dma_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315631
[87323.973171] ide: failed opcode was: unknown
[87328.846265] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
[87328.853485] hdc: dma_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315631
[87328.862834] ide: failed opcode was: unknown
[87333.736127] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
[87333.743535] hdc: dma_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315631
[87333.752876] ide: failed opcode was: unknown
[87333.806569] ide1: reset: success
[87338.675891] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87338.685143] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711
[87338.694791] ide: failed opcode was: unknown
[87343.557424] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87343.566388] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711
[87343.576105] ide: failed opcode was: unknown
[87348.472226] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87348.481170] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711
[87348.490843] ide: failed opcode was: unknown
[87353.387028] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87353.395735] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711
[87353.405500] ide: failed opcode was: unknown
[87353.461342] ide1: reset: success
[87358.326783] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87358.335739] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718
[87358.345395] ide: failed opcode was: unknown
[87363.208313] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87363.217319] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718
[87363.228371] ide: failed opcode was: unknown
[87368.106472] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87368.115414] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718
[87368.125275] ide: failed opcode was: unknown
[87372.979686] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87372.988706] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718
[87372.998849] ide: failed opcode was: unknown
[87373.052152] ide1: reset: success
[87377.927744] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87377.936682] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718
[87377.946399] ide: failed opcode was: unknown
[87382.800953] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87382.809881] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718
[87382.819511] ide: failed opcode was: unknown
[87387.682479] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87387.691473] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718
[87387.701287] ide: failed opcode was: unknown
[87392.564004] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87392.572790] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718
[87392.582454] ide: failed opcode was: unknown
[87392.635961] ide1: reset: success
[87397.528687] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87397.537607] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718
[87397.547335] ide: failed opcode was: unknown
[87397.551897] end_request: I/O error, dev hdc, sector 176315718
[87398.520820] raid1: Disk failure on hdc11, disabling device.
[87398.520826] Operation continuing on 1 devices
[87398.531579] blk: request botched
[87398.535098] hdc: task_out_intr: status=0x50 { DriveReady SeekComplete }
[87398.542129] ide: failed opcode was: unknown
[87403.582775] ------------[ cut here ]------------
[87403.587748] kernel BUG at mm/filemap.c:541!
[87403.592082] invalid opcode: 0000 [#1]
[87403.596063] SMP
[87403.598217] Modules linked in: w83781d hwmon_vid i2c_isa i2c_core w83627hf_wdt
[87403.606114] CPU: 0
[87403.606117] EIP: 0060:[<c01406a7>] Not tainted VLI
[87403.606120] EFLAGS: 00010046 (2.6.18.2debug #1)
[87403.619728] EIP is at unlock_page+0x12/0x2d
[87403.624170] eax: 00000000 ebx: c2d5caa8 ecx: e8148680 edx: c2d5caa8
[87403.631543] esi: da71c600 edi: 00000001 ebp: c04cfe28 esp: c04cfe24
[87403.638924] ds: 007b es: 007b ss: 0068
[87403.643419] Process swapper (pid: 0, ti=c04ce000 task=c041e500 task.ti=c04ce000)
[87403.650774] Stack: e81487e8 c04cfe3c c0180e0a da71c600 00000000 c0180dac c04cfe64 c0164af9
[87403.659985] f7d49000 c04cfe84 f2dea5a0 f2dea5a0 00000000 da71c600 00000000 da71c600
[87403.669288] c04cfea8 c0256778 c041e500 00000000 c04cbd90 00000046 00000000 00000000
[87403.678603] Call Trace:
[87403.681462] [<c0103bba>] show_stack_log_lvl+0x8d/0xaa
[87403.686911] [<c0103ddc>] show_registers+0x1b0/0x221
[87403.692306] [<c0103ffc>] die+0x124/0x1ee
[87403.696558] [<c0104165>] do_trap+0x9f/0xa1
[87403.700988] [<c0104427>] do_invalid_op+0xa7/0xb1
[87403.706012] [<c0103871>] error_code+0x39/0x40
[87403.710794] [<c0180e0a>] mpage_end_io_read+0x5e/0x72
[87403.716154] [<c0164af9>] bio_endio+0x56/0x7b
[87403.720798] [<c0256778>] __end_that_request_first+0x1e0/0x301
[87403.726985] [<c02568a4>] end_that_request_first+0xb/0xd
[87403.732699] [<c02bd73c>] __ide_end_request+0x54/0xe1
[87403.738214] [<c02bd807>] ide_end_request+0x3e/0x5c
[87403.743382] [<c02c35df>] task_error+0x5b/0x97
[87403.748113] [<c02c36fa>] task_in_intr+0x6e/0xa2
[87403.753120] [<c02bf19e>] ide_intr+0xaf/0x12c
[87403.757815] [<c013e5a7>] handle_IRQ_event+0x23/0x57
[87403.763135] [<c013e66f>] __do_IRQ+0x94/0xfd
[87403.767802] [<c0105192>] do_IRQ+0x32/0x68
[87403.772278] [<c010372e>] common_interrupt+0x1a/0x20
[87403.777586] [<c0100cfe>] cpu_idle+0x7d/0x86
[87403.782184] [<c01002b7>] rest_init+0x23/0x25
[87403.786869] [<c04d4889>] start_kernel+0x175/0x19d
[87403.791963] [<00000000>] 0x0
[87403.795270] Code: ff ff ff b9 0b 00 14 c0 8d f0 <0>Kernel panic - not syncing: Fatal exception in interrupt
smp_call_function+0x12b/0x130
[87403.935200] [<c013e66f>]
2*) The new problem log:
... rebuilding new spare partitions into a 3-way array went well...
[114683.721498] --- wd:3 rd:3
[114683.721508] disk 0, wo:0, o:1, dev:hdc10
[114683.721541] disk 1, wo:0, o:1, dev:hda10
[114683.721556] disk 2, wo:0, o:1, dev:hdd10
[115150.208060] RAID1 conf printout:
[115150.211893] --- wd:3 rd:3
[115150.214923] disk 0, wo:0, o:1, dev:hdc8
[115150.219269] disk 1, wo:0, o:1, dev:hda8
[115150.224022] disk 2, wo:0, o:1, dev:hdd8
[115296.149849] RAID1 conf printout:
[115296.149868] --- wd:3 rd:3
[115296.149895] disk 0, wo:0, o:1, dev:hdc7
[115296.149915] disk 1, wo:0, o:1, dev:hda7
[115296.149931] disk 2, wo:0, o:1, dev:hdd7
[115460.641824] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
[115460.649294] hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=28433205, sector=28433023
[115460.658774] ide: failed opcode was: unknown
[115460.663287] end_request: I/O error, dev hdc, sector 28433023
[115465.938995] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
[115465.946451] hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=28433201, sector=28433031
[115465.957536] ide: failed opcode was: unknown
[115465.962041] end_request: I/O error, dev hdc, sector 28433031
[115488.743391] hdc: dma_timer_expiry: dma status == 0x60
[115488.748714] hdc: DMA timeout retry
[115488.752450] hdc: timeout waiting for DMA
[115488.979377] hdc: status error: status=0x58 { DriveReady SeekComplete DataRequest }
[115488.987829] ide: failed opcode was: unknown
[115488.992469] hdc: drive not ready for command
[115518.991827] hdc: lost interrupt
[115518.995201] hdc: task_out_intr: status=0x58 { DriveReady SeekComplete DataRequest }
[115519.003493] ide: failed opcode was: unknown
[115524.009004] hdc: status timeout: status=0xd0 { Busy }
[115524.014441] ide: failed opcode was: unknown
[115524.018988] hdc: drive not ready for command
[115524.114548] ide1: reset: success
[115524.198619] raid1: Disk failure on hdc6, disabling device.
[115524.198624] Operation continuing on 2 devices
[115524.209260] hdc: task_out_intr: status=0x50 { DriveReady SeekComplete }
[115544.209530] hdc: dma_timer_expiry: dma status == 0x20
[115544.215027] hdc: DMA timeout retry
[115544.218669] hdc: timeout waiting for DMA
[115544.461490] hdc: status error: status=0x58 { DriveReady SeekComplete DataRequest }
[115544.469875] ide: failed opcode was: unknown
[115544.474388] hdc: drive not ready for command
[115544.503556] RAID1 conf printout:
[115544.507091] --- wd:1 rd:2
[115544.510120] disk 0, wo:0, o:1, dev:hdd1
[115544.712573] ------------[ cut here ]------------
[115544.717459] kernel BUG at mm/filemap.c:541!
[115544.721902] invalid opcode: 0000 [#1]
[115544.725806] SMP
[115544.728015] Modules linked in: w83781d hwmon_vid i2c_isa i2c_core w83627hf_wdt
[115544.735981] CPU: 0
[115544.735983] EIP: 0060:[<c013c436>] Not tainted VLI
[115544.735986] EFLAGS: 00010046 (2.6.18.2server #4)
[115544.749248] EIP is at unlock_page+0xf/0x28
[115544.753589] eax: 00000000 ebx: c1f5d8c0 ecx: eab25680 edx: c1f5d8c0
[115544.760627] esi: f6a78f00 edi: 00000001 ebp: 00000000 esp: c0481e7c
[115544.767703] ds: 007b es: 007b ss: 0068
[115544.772033] Process swapper (pid: 0, ti=c0480000 task=c040e460 task.ti=c0480000)
[115544.779512] Stack: eab25674 c017bed6 f6a78f00 00000000 c017be7c c016069e 00000003 c2e2aa20
[115544.788824] c0115cf1 00000020 f6a78f00 00003400 f6a78f00 00000000 c024f8ac f7f37a00
[115544.798171] f7f37a00 00000001 c04fcd94 00000001 00000000 00000000 00000000 00002c00
[115544.807683] Call Trace:
[115544.810797] [<c017bed6>] mpage_end_io_read+0x5a/0x6e
[115544.816407] [<c017be7c>] mpage_end_io_read+0x0/0x6e
[115544.821865] [<c016069e>] bio_endio+0x5f/0x84
[115544.826749] [<c0115cf1>] find_busiest_group+0x153/0x4df
[115544.832582] [<c024f8ac>] __end_that_request_first+0x1ec/0x31c
[115544.838889] [<c02b508a>] __ide_end_request+0x56/0xe4
[115544.844462] [<c02b5159>] ide_end_request+0x41/0x65
[115544.849671] [<c02bad69>] task_end_request+0x37/0x7e
[115544.854950] [<c02baed9>] task_out_intr+0x84/0xb9
[115544.860128] [<c01243bd>] del_timer+0x56/0x58
[115544.864839] [<c02bae55>] task_out_intr+0x0/0xb9
[115544.869760] [<c02b6a4b>] ide_intr+0xb7/0x132
[115544.874467] [<c013a2f4>] handle_IRQ_event+0x26/0x59
[115544.879740] [<c013a3ae>] __do_IRQ+0x87/0xed
[115544.884368] [<c0105199>] do_IRQ+0x31/0x69
[115544.888843] [<c0103746>] common_interrupt+0x1a/0x20
[115544.894124] [<c0100c4f>] default_idle+0x35/0x5b
[115544.899233] [<c0100d03>] cpu_idle+0x7a/0x83
[115544.903844] [<c0486810>] start_kernel+0x16a/0x194
[115544.908950] [<c0486286>] unknown_bootoption+0x0/0x1b6
[115544.914384] Code: e8 5e ff ff ff b9 __wake_up+0x38/0x4e
[115545.083930] [<c01243bd>]
--
Best regards,
Jim Klimov mailto:klimov@2ka.mipt.ru
prev parent reply other threads:[~2006-11-14 10:36 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-11-10 8:17 RAID1 submirror failure causes reboot? Jim Klimov
2006-11-10 8:41 ` Neil Brown
2006-11-10 12:53 ` Re[2]: " Jim Klimov
2006-11-13 7:17 ` Neil Brown
2006-11-13 20:11 ` Jens Axboe
2006-11-13 22:05 ` Neil Brown
2006-11-14 7:28 ` Jens Axboe
2006-11-14 10:36 ` Jim Klimov [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=77664873.20061114133612@2ka.mipt.ru \
--to=klimov@2ka.mipt.ru \
--cc=jens.axboe@oracle.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).