* RAID1 submirror failure causes reboot?
@ 2006-11-10 8:17 Jim Klimov
2006-11-10 8:41 ` Neil Brown
0 siblings, 1 reply; 8+ messages in thread
From: Jim Klimov @ 2006-11-10 8:17 UTC (permalink / raw)
To: Linux RAID
Hello Linux RAID,
One of our servers using per-partition mirroring has a
frequently-failing partition, hdc11 below.
When it is dubbed failing, the server usually crashes
with a stacktrace like below. This seems strange, because
the other submirror, hda11 is alive and well, and this
should all be transparent thru the RAID layer? This is
what it's for?
After the reboot I usually succeed in hot-adding hdc11
back to the mirror, although several times it was not
marked dead at all and rebuilt by itself after reboot.
Also seems rather incorrect: if it died, it should be
marked so (perhaps in metadata on a live mirror)?
Overall, uncool (although mirroring has saved us many
times, thanks!)
Nov 10 03:56:51 video kernel: [84443.270516] md: syncing RAID array md11
Nov 10 03:56:52 video kernel: [84443.270532] md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/
disc.
Nov 10 03:56:54 video kernel: [84443.270544] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for reconstruction.
Nov 10 03:56:55 video kernel: [84443.270565] md: using 128k window, over a total of 65430144 blocks.
Nov 10 03:56:56 video kernel: [84443.271478] RAID1 conf printout:
Nov 10 03:57:01 video kernel: [84443.275446] --- wd:2 rd:2
Nov 10 03:57:10 video kernel: [84443.278773] disk 0, wo:0, o:1, dev:hdc10
Nov 10 03:57:11 video kernel: [84443.283272] disk 1, wo:0, o:1, dev:hda10
[87319.049902] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
[87319.057393] hdc: dma_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315631
[87319.067205] ide: failed opcode was: unknown
[87323.956399] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
[87323.963681] hdc: dma_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315631
[87323.973171] ide: failed opcode was: unknown
[87328.846265] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
[87328.853485] hdc: dma_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315631
[87328.862834] ide: failed opcode was: unknown
[87333.736127] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
[87333.743535] hdc: dma_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315631
[87333.752876] ide: failed opcode was: unknown
[87333.806569] ide1: reset: success
[87338.675891] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87338.685143] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711
[87338.694791] ide: failed opcode was: unknown
[87343.557424] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87343.566388] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711
[87343.576105] ide: failed opcode was: unknown
[87348.472226] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87348.481170] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711
[87348.490843] ide: failed opcode was: unknown
[87353.387028] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87353.395735] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711
[87353.405500] ide: failed opcode was: unknown
[87353.461342] ide1: reset: success
[87358.326783] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87358.335739] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718
[87358.345395] ide: failed opcode was: unknown
[87363.208313] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87363.217319] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718
[87363.228371] ide: failed opcode was: unknown
[87368.106472] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87368.115414] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718
[87368.125275] ide: failed opcode was: unknown
[87372.979686] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87372.988706] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718
[87372.998849] ide: failed opcode was: unknown
[87373.052152] ide1: reset: success
[87377.927744] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87377.936682] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718
[87377.946399] ide: failed opcode was: unknown
[87382.800953] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87382.809881] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718
[87382.819511] ide: failed opcode was: unknown
[87387.682479] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87387.691473] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718
[87387.701287] ide: failed opcode was: unknown
[87392.564004] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87392.572790] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718
[87392.582454] ide: failed opcode was: unknown
[87392.635961] ide1: reset: success
[87397.528687] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
[87397.537607] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718
[87397.547335] ide: failed opcode was: unknown
[87397.551897] end_request: I/O error, dev hdc, sector 176315718
[87398.520820] raid1: Disk failure on hdc11, disabling device.
[87398.520826] Operation continuing on 1 devices
[87398.531579] blk: request botched
[87398.535098] hdc: task_out_intr: status=0x50 { DriveReady SeekComplete }
[87398.542129] ide: failed opcode was: unknown
[87403.582775] ------------[ cut here ]------------
[87403.587748] kernel BUG at mm/filemap.c:541!
[87403.592082] invalid opcode: 0000 [#1]
[87403.596063] SMP
[87403.598217] Modules linked in: w83781d hwmon_vid i2c_isa i2c_core w83627hf_wdt
[87403.606114] CPU: 0
[87403.606117] EIP: 0060:[<c01406a7>] Not tainted VLI
[87403.606120] EFLAGS: 00010046 (2.6.18.2debug #1)
[87403.619728] EIP is at unlock_page+0x12/0x2d
[87403.624170] eax: 00000000 ebx: c2d5caa8 ecx: e8148680 edx: c2d5caa8
[87403.631543] esi: da71c600 edi: 00000001 ebp: c04cfe28 esp: c04cfe24
[87403.638924] ds: 007b es: 007b ss: 0068
[87403.643419] Process swapper (pid: 0, ti=c04ce000 task=c041e500 task.ti=c04ce000)
[87403.650774] Stack: e81487e8 c04cfe3c c0180e0a da71c600 00000000 c0180dac c04cfe64 c0164af9
[87403.659985] f7d49000 c04cfe84 f2dea5a0 f2dea5a0 00000000 da71c600 00000000 da71c600
[87403.669288] c04cfea8 c0256778 c041e500 00000000 c04cbd90 00000046 00000000 00000000
[87403.678603] Call Trace:
[87403.681462] [<c0103bba>] show_stack_log_lvl+0x8d/0xaa
[87403.686911] [<c0103ddc>] show_registers+0x1b0/0x221
[87403.692306] [<c0103ffc>] die+0x124/0x1ee
[87403.696558] [<c0104165>] do_trap+0x9f/0xa1
[87403.700988] [<c0104427>] do_invalid_op+0xa7/0xb1
[87403.706012] [<c0103871>] error_code+0x39/0x40
[87403.710794] [<c0180e0a>] mpage_end_io_read+0x5e/0x72
[87403.716154] [<c0164af9>] bio_endio+0x56/0x7b
[87403.720798] [<c0256778>] __end_that_request_first+0x1e0/0x301
[87403.726985] [<c02568a4>] end_that_request_first+0xb/0xd
[87403.732699] [<c02bd73c>] __ide_end_request+0x54/0xe1
[87403.738214] [<c02bd807>] ide_end_request+0x3e/0x5c
[87403.743382] [<c02c35df>] task_error+0x5b/0x97
[87403.748113] [<c02c36fa>] task_in_intr+0x6e/0xa2
[87403.753120] [<c02bf19e>] ide_intr+0xaf/0x12c
[87403.757815] [<c013e5a7>] handle_IRQ_event+0x23/0x57
[87403.763135] [<c013e66f>] __do_IRQ+0x94/0xfd
[87403.767802] [<c0105192>] do_IRQ+0x32/0x68
[87403.772278] [<c010372e>] common_interrupt+0x1a/0x20
[87403.777586] [<c0100cfe>] cpu_idle+0x7d/0x86
[87403.782184] [<c01002b7>] rest_init+0x23/0x25
[87403.786869] [<c04d4889>] start_kernel+0x175/0x19d
[87403.791963] [<00000000>] 0x0
[87403.795270] Code: ff ff ff b9 0b 00 14 c0 8d 55 dc c7 04 24 02 00 00 00 e8 21 26 25 00 eb dc 55 89 e5
53 89 c3 31 c0 f0 0f b3 03 19 c0 85 c0 75 08 <0f> 0b 1d 02 6c bf 3b c0 89 d8 e8 34 ff ff ff 89 da 31 c9
e8 24
[87403.819040] EIP: [<c01406a7>] unlock_page+0x12/0x2d SS:ESP 0068:c04cfe24
[87403.826101] <0>Kernel panic - not syncing: Fatal exception in interrupt
--
Best regards,
Jim Klimov mailto:klimov@2ka.mipt.ru
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: RAID1 submirror failure causes reboot? 2006-11-10 8:17 RAID1 submirror failure causes reboot? Jim Klimov @ 2006-11-10 8:41 ` Neil Brown 2006-11-10 12:53 ` Re[2]: " Jim Klimov 0 siblings, 1 reply; 8+ messages in thread From: Neil Brown @ 2006-11-10 8:41 UTC (permalink / raw) To: Jim Klimov; +Cc: Linux RAID, jens.axboe On Friday November 10, klimov@2ka.mipt.ru wrote: > Hello Linux RAID, > > One of our servers using per-partition mirroring has a > frequently-failing partition, hdc11 below. > > When it is dubbed failing, the server usually crashes > with a stacktrace like below. This seems strange, because > the other submirror, hda11 is alive and well, and this > should all be transparent thru the RAID layer? This is > what it's for? > > After the reboot I usually succeed in hot-adding hdc11 > back to the mirror, although several times it was not > marked dead at all and rebuilt by itself after reboot. > Also seems rather incorrect: if it died, it should be > marked so (perhaps in metadata on a live mirror)? > > Overall, uncool (although mirroring has saved us many > times, thanks!) > --snip-- > [87392.564004] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } > [87392.572790] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718 > [87392.582454] ide: failed opcode was: unknown > [87392.635961] ide1: reset: success > [87397.528687] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } > [87397.537607] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718 > [87397.547335] ide: failed opcode was: unknown > [87397.551897] end_request: I/O error, dev hdc, sector 176315718 > [87398.520820] raid1: Disk failure on hdc11, disabling device. > [87398.520826] Operation continuing on 1 devices > [87398.531579] blk: request botched ^^^^^^^^^^^^^^^^^^^^ That looks bad. Possible some bug in the IDE controller or elsewhere in the block layer. Jens: What might cause that? --snip-- > [87403.678603] Call Trace: > [87403.681462] [<c0103bba>] show_stack_log_lvl+0x8d/0xaa > [87403.686911] [<c0103ddc>] show_registers+0x1b0/0x221 > [87403.692306] [<c0103ffc>] die+0x124/0x1ee > [87403.696558] [<c0104165>] do_trap+0x9f/0xa1 > [87403.700988] [<c0104427>] do_invalid_op+0xa7/0xb1 > [87403.706012] [<c0103871>] error_code+0x39/0x40 > [87403.710794] [<c0180e0a>] mpage_end_io_read+0x5e/0x72 > [87403.716154] [<c0164af9>] bio_endio+0x56/0x7b > [87403.720798] [<c0256778>] __end_that_request_first+0x1e0/0x301 > [87403.726985] [<c02568a4>] end_that_request_first+0xb/0xd > [87403.732699] [<c02bd73c>] __ide_end_request+0x54/0xe1 > [87403.738214] [<c02bd807>] ide_end_request+0x3e/0x5c > [87403.743382] [<c02c35df>] task_error+0x5b/0x97 > [87403.748113] [<c02c36fa>] task_in_intr+0x6e/0xa2 > [87403.753120] [<c02bf19e>] ide_intr+0xaf/0x12c > [87403.757815] [<c013e5a7>] handle_IRQ_event+0x23/0x57 > [87403.763135] [<c013e66f>] __do_IRQ+0x94/0xfd > [87403.767802] [<c0105192>] do_IRQ+0x32/0x68 That doesn't look like raid was involved. If it was you would expect to see raid1_end_write_request or raid1_end_read_request in that trace. Do you have any other partitions of hdc in use but not on raid? Which partition is sector 176315718 in ?? NeilBrown ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re[2]: RAID1 submirror failure causes reboot? 2006-11-10 8:41 ` Neil Brown @ 2006-11-10 12:53 ` Jim Klimov 2006-11-13 7:17 ` Neil Brown 0 siblings, 1 reply; 8+ messages in thread From: Jim Klimov @ 2006-11-10 12:53 UTC (permalink / raw) To: Neil Brown; +Cc: Linux RAID Hello Neil, >> [87398.531579] blk: request botched NB> ^^^^^^^^^^^^^^^^^^^^ NB> That looks bad. Possible some bug in the IDE controller or elsewhere NB> in the block layer. Jens: What might cause that? NB> --snip-- NB> That doesn't look like raid was involved. If it was you would expect NB> to see raid1_end_write_request or raid1_end_read_request in that NB> trace. So that might be the hard or soft part of IDE layer failing the system, or a PCI problem for example? Just in case, motherboard is Supermicro X5DPE-G2: e7501 chipset, Dual Xeon-533. Rather old now, way cool back when it was bought ;) But, as anything new, it could be manufactured bad. Maybe now there are some well-known problems with the tech? NB> Do you have any other partitions of hdc in use but not on raid? NB> Which partition is sector 176315718 in ?? All partitions are mirrored, and this sector is in hdc11: # fdisk /dev/hdc ... Command (m for help): u Changing display/entry units to sectors Command (m for help): p Disk /dev/hdc: 102.9 GB, 102935347200 bytes 16 heads, 63 sectors/track, 199450 cylinders, total 201045600 sectors Units = sectors of 1 * 512 = 512 bytes Device Boot Start End Blocks Id System /dev/hdc1 * 63 1000943 500440+ fd Linux raid autodetect /dev/hdc2 1000944 5001695 2000376 fd Linux raid autodetect /dev/hdc3 5001696 16721711 5860008 83 Linux /dev/hdc4 16721712 201045599 92161944 5 Extended /dev/hdc5 16721775 24534719 3906472+ 83 Linux /dev/hdc6 24534783 28441727 1953472+ 83 Linux /dev/hdc7 28441791 32348735 1953472+ 83 Linux /dev/hdc8 32348799 44068751 5859976+ 83 Linux /dev/hdc9 44068815 51881759 3906472+ 83 Linux /dev/hdc10 51881823 59694767 3906472+ 83 Linux /dev/hdc11 59694831 190555343 65430256+ 83 Linux /dev/hdc12 190555407 201045599 5245096+ 83 Linux -- Best regards, Jim Klimov mailto:klimov@2ka.mipt.ru ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re[2]: RAID1 submirror failure causes reboot? 2006-11-10 12:53 ` Re[2]: " Jim Klimov @ 2006-11-13 7:17 ` Neil Brown 2006-11-13 20:11 ` Jens Axboe 0 siblings, 1 reply; 8+ messages in thread From: Neil Brown @ 2006-11-13 7:17 UTC (permalink / raw) To: Jim Klimov; +Cc: Jens Axboe, Linux RAID On Friday November 10, klimov@2ka.mipt.ru wrote: > Hello Neil, > > >> [87398.531579] blk: request botched > NB> ^^^^^^^^^^^^^^^^^^^^ > > NB> That looks bad. Possible some bug in the IDE controller or elsewhere > NB> in the block layer. Jens: What might cause that? > > NB> --snip-- > > NB> That doesn't look like raid was involved. If it was you would expect > NB> to see raid1_end_write_request or raid1_end_read_request in that > NB> trace. > So that might be the hard or soft part of IDE layer failing the > system, or a PCI problem for example? What I think is happening here (and Jens: if you could tell me how impossible this is, that would be good) is this: Some error handling somewhere in the low-level ide driver is getting confused and somehow one the sector counts in the 'struct request' is getting set wrongly. blk_recalc_rq_sectors notices this and says "blk: request botched". It tries to auto-correct by increasing rq->nr_sectors to be consistent with other counts. I'm *guessing* this is the wrong thing to do, and that it has a side-effect but bi_end_io is getting called on the Bi twice. The second time the bio has been freed and reused and the wrong b_end_io is called and it does the wrong thing. This sounds a bit far-fetched, but it is the only explanation I can come up with for the observed back trace which is: [87403.706012] [<c0103871>] error_code+0x39/0x40 [87403.710794] [<c0180e0a>] mpage_end_io_read+0x5e/0x72 [87403.716154] [<c0164af9>] bio_endio+0x56/0x7b [87403.720798] [<c0256778>] __end_that_request_first+0x1e0/0x301 [87403.726985] [<c02568a4>] end_that_request_first+0xb/0xd [87403.732699] [<c02bd73c>] __ide_end_request+0x54/0xe1 [87403.738214] [<c02bd807>] ide_end_request+0x3e/0x5c [87403.743382] [<c02c35df>] task_error+0x5b/0x97 [87403.748113] [<c02c36fa>] task_in_intr+0x6e/0xa2 [87403.753120] [<c02bf19e>] ide_intr+0xaf/0x12c [87403.757815] [<c013e5a7>] handle_IRQ_event+0x23/0x57 [87403.763135] [<c013e66f>] __do_IRQ+0x94/0xfd [87403.767802] [<c0105192>] do_IRQ+0x32/0x68 [87403.772278] [<c010372e>] common_interrupt+0x1a/0x20 i.e. bio_endio goes straight to mpage_end_io despite the face that the filesystem is mounted over md/raid1. Is the kernel compiled with CONFIG_DEBUG_SLAB=y and CONFIG_DEBUG_PAGEALLOC=y ?? They might help trigger the error earlier and so make the problem more obvious. NeilBrown ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re[2]: RAID1 submirror failure causes reboot? 2006-11-13 7:17 ` Neil Brown @ 2006-11-13 20:11 ` Jens Axboe 2006-11-13 22:05 ` Neil Brown 0 siblings, 1 reply; 8+ messages in thread From: Jens Axboe @ 2006-11-13 20:11 UTC (permalink / raw) To: Neil Brown; +Cc: Jim Klimov, Linux RAID On Mon, Nov 13 2006, Neil Brown wrote: > On Friday November 10, klimov@2ka.mipt.ru wrote: > > Hello Neil, > > > > >> [87398.531579] blk: request botched > > NB> ^^^^^^^^^^^^^^^^^^^^ > > > > NB> That looks bad. Possible some bug in the IDE controller or elsewhere > > NB> in the block layer. Jens: What might cause that? > > > > NB> --snip-- > > > > NB> That doesn't look like raid was involved. If it was you would expect > > NB> to see raid1_end_write_request or raid1_end_read_request in that > > NB> trace. > > So that might be the hard or soft part of IDE layer failing the > > system, or a PCI problem for example? > > What I think is happening here (and Jens: if you could tell me how > impossible this is, that would be good) is this: > > Some error handling somewhere in the low-level ide driver is getting > confused and somehow one the sector counts in the 'struct request' is > getting set wrongly. blk_recalc_rq_sectors notices this and says > "blk: request botched". It tries to auto-correct by increasing > rq->nr_sectors to be consistent with other counts. > I'm *guessing* this is the wrong thing to do, and that it has a > side-effect but bi_end_io is getting called on the Bi twice. > The second time the bio has been freed and reused and the wrong > b_end_io is called and it does the wrong thing. It doesn't sound at all unreasonable. It's most likely either a bug in the ide driver, or a "bad" bio being passed to the block layer (and later on to the request and driver). By "bad" I mean one that isn't entirely consistent, which could be a bug in eg md. The "request botched" error is usually a sign of something being severly screwed up. As you mention further down, get slab and page debugging enabled to potentially catch this earlier. It could be a sign of a freed bio or request with corrupt contents. > This sounds a bit far-fetched, but it is the only explanation I can > come up with for the observed back trace which is: > > [87403.706012] [<c0103871>] error_code+0x39/0x40 > [87403.710794] [<c0180e0a>] mpage_end_io_read+0x5e/0x72 > [87403.716154] [<c0164af9>] bio_endio+0x56/0x7b > [87403.720798] [<c0256778>] __end_that_request_first+0x1e0/0x301 > [87403.726985] [<c02568a4>] end_that_request_first+0xb/0xd > [87403.732699] [<c02bd73c>] __ide_end_request+0x54/0xe1 > [87403.738214] [<c02bd807>] ide_end_request+0x3e/0x5c > [87403.743382] [<c02c35df>] task_error+0x5b/0x97 > [87403.748113] [<c02c36fa>] task_in_intr+0x6e/0xa2 > [87403.753120] [<c02bf19e>] ide_intr+0xaf/0x12c > [87403.757815] [<c013e5a7>] handle_IRQ_event+0x23/0x57 > [87403.763135] [<c013e66f>] __do_IRQ+0x94/0xfd > [87403.767802] [<c0105192>] do_IRQ+0x32/0x68 > [87403.772278] [<c010372e>] common_interrupt+0x1a/0x20 > > i.e. bio_endio goes straight to mpage_end_io despite the face that the > filesystem is mounted over md/raid1. What is the workload? Is io to the real device mixed with io that came through md as well? > Is the kernel compiled with CONFIG_DEBUG_SLAB=y and > CONFIG_DEBUG_PAGEALLOC=y ?? > They might help trigger the error earlier and so make the problem more > obvious. Agree, that would be a good plan to enable. Other questions: are you seeing timeouts at any point? The ide timeout code has some request/bio "resetting" code which might be worrisome. > > NeilBrown -- Jens Axboe ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re[2]: RAID1 submirror failure causes reboot? 2006-11-13 20:11 ` Jens Axboe @ 2006-11-13 22:05 ` Neil Brown 2006-11-14 7:28 ` Jens Axboe 0 siblings, 1 reply; 8+ messages in thread From: Neil Brown @ 2006-11-13 22:05 UTC (permalink / raw) To: Jens Axboe; +Cc: Jim Klimov, Linux RAID On Monday November 13, jens.axboe@oracle.com wrote: > > It doesn't sound at all unreasonable. It's most likely either a bug in > the ide driver, or a "bad" bio being passed to the block layer (and > later on to the request and driver). By "bad" I mean one that isn't > entirely consistent, which could be a bug in eg md. I just noticed (while tracking raid6 problems...) that bio_clone calls bio_phys_segments and bio_hw_segments (why does it do both?). This calls blk_recount_segments which does calculations based on ->bi_bdev. Only immediately after calling bio_clone, raid1 changes bi_bdev, thus creating potential inconsistency in the bio. Would this sort of inconsistency cause this problem? > > Agree, that would be a good plan to enable. Other questions: are you > seeing timeouts at any point? The ide timeout code has some request/bio > "resetting" code which might be worrisome. Jim could probably answer this with more authority, but there aren't obvious timeouts from the logs he posted. A representative sample is: [87338.675891] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } [87338.685143] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711 [87338.694791] ide: failed opcode was: unknown [87343.557424] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } [87343.566388] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711 [87343.576105] ide: failed opcode was: unknown [87348.472226] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } [87348.481170] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711 [87348.490843] ide: failed opcode was: unknown [87353.387028] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } [87353.395735] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711 [87353.405500] ide: failed opcode was: unknown [87353.461342] ide1: reset: success Thanks, NeilBrown ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re[2]: RAID1 submirror failure causes reboot? 2006-11-13 22:05 ` Neil Brown @ 2006-11-14 7:28 ` Jens Axboe 2006-11-14 10:36 ` Re[4]: " Jim Klimov 0 siblings, 1 reply; 8+ messages in thread From: Jens Axboe @ 2006-11-14 7:28 UTC (permalink / raw) To: Neil Brown; +Cc: Jim Klimov, Linux RAID On Tue, Nov 14 2006, Neil Brown wrote: > On Monday November 13, jens.axboe@oracle.com wrote: > > > > It doesn't sound at all unreasonable. It's most likely either a bug in > > the ide driver, or a "bad" bio being passed to the block layer (and > > later on to the request and driver). By "bad" I mean one that isn't > > entirely consistent, which could be a bug in eg md. > > I just noticed (while tracking raid6 problems...) that bio_clone calls > bio_phys_segments and bio_hw_segments (why does it do both?). > This calls blk_recount_segments which does calculations based on > ->bi_bdev. > Only immediately after calling bio_clone, raid1 changes bi_bdev, thus > creating potential inconsistency in the bio. Would this sort of > inconsistency cause this problem? raid1 should change it first, you are right. But it should not matter, as the real device should have restrictions that are at least equal to the md device. So it may be a bit more conservative, but I don't think there's a problem bug there. > > Agree, that would be a good plan to enable. Other questions: are you > > seeing timeouts at any point? The ide timeout code has some request/bio > > "resetting" code which might be worrisome. > > Jim could probably answer this with more authority, but there aren't > obvious timeouts from the logs he posted. A representative sample is: > [87338.675891] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } > [87338.685143] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711 > [87338.694791] ide: failed opcode was: unknown > [87343.557424] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } > [87343.566388] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711 > [87343.576105] ide: failed opcode was: unknown > [87348.472226] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } > [87348.481170] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711 > [87348.490843] ide: failed opcode was: unknown > [87353.387028] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } > [87353.395735] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711 > [87353.405500] ide: failed opcode was: unknown > [87353.461342] ide1: reset: success Then lets wait for Jim to repeat his testing with all the debugging options enabled, that should make us a little wiser. -- Jens Axboe ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re[4]: RAID1 submirror failure causes reboot? 2006-11-14 7:28 ` Jens Axboe @ 2006-11-14 10:36 ` Jim Klimov 0 siblings, 0 replies; 8+ messages in thread From: Jim Klimov @ 2006-11-14 10:36 UTC (permalink / raw) To: Jens Axboe; +Cc: Jim Klimov, Linux RAID Hello Jens, JA> Then lets wait for Jim to repeat his testing with all the debugging JA> options enabled, that should make us a little wiser. Ok, I'll enable the kernel rebuilt with these options and report any findings later. So far, I'll move to the other questions aroused. I remember when I ran with most of debugging ON last week, it made the server run darn slow, with LA around 40-60 and little responsiveness, and numerous messages like: Hangcheck: hangcheck value past margin! BUG: soft lockup detected on CPU#1! which hide the interesting traces :) However, I can post these captured traces of several system lifetimes to the list or privately. Concerning other questions: 1) The workload on the software raid is rather small. It's a set of system partitions which keep fileserver's logs, etc. The file storage is on 3Ware cards and has substantial load. MD arrays are checked nightly, though (echo check > sync_action), and most often this triggers the problem. These drives only contain mirrored partitions, so there should be no I/O to these drives around MD, except for rare cases of lilo running :) 2) I have installed a third submirror disk this weekend, it's an IDE slave device hdd (near the failing hdc). Since then I got errors on other partitions, attached below as "2*)". 3) The failures which lead to reboots are usually preceded by a long history of dma_intr errors 0x40 and 0x51, but that sample I sent was rather full. A few errors preceded it every 5 seconds, making the full trace like this: [87319.049902] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error } [87319.057393] hdc: dma_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315631 [87319.067205] ide: failed opcode was: unknown [87323.956399] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error } [87323.963681] hdc: dma_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315631 [87323.973171] ide: failed opcode was: unknown [87328.846265] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error } [87328.853485] hdc: dma_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315631 [87328.862834] ide: failed opcode was: unknown [87333.736127] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error } [87333.743535] hdc: dma_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315631 [87333.752876] ide: failed opcode was: unknown [87333.806569] ide1: reset: success [87338.675891] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } [87338.685143] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711 [87338.694791] ide: failed opcode was: unknown [87343.557424] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } [87343.566388] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711 [87343.576105] ide: failed opcode was: unknown [87348.472226] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } [87348.481170] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711 [87348.490843] ide: failed opcode was: unknown [87353.387028] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } [87353.395735] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315711 [87353.405500] ide: failed opcode was: unknown [87353.461342] ide1: reset: success [87358.326783] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } [87358.335739] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718 [87358.345395] ide: failed opcode was: unknown [87363.208313] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } [87363.217319] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718 [87363.228371] ide: failed opcode was: unknown [87368.106472] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } [87368.115414] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718 [87368.125275] ide: failed opcode was: unknown [87372.979686] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } [87372.988706] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718 [87372.998849] ide: failed opcode was: unknown [87373.052152] ide1: reset: success [87377.927744] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } [87377.936682] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718 [87377.946399] ide: failed opcode was: unknown [87382.800953] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } [87382.809881] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718 [87382.819511] ide: failed opcode was: unknown [87387.682479] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } [87387.691473] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718 [87387.701287] ide: failed opcode was: unknown [87392.564004] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } [87392.572790] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718 [87392.582454] ide: failed opcode was: unknown [87392.635961] ide1: reset: success [87397.528687] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } [87397.537607] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718 [87397.547335] ide: failed opcode was: unknown [87397.551897] end_request: I/O error, dev hdc, sector 176315718 [87398.520820] raid1: Disk failure on hdc11, disabling device. [87398.520826] Operation continuing on 1 devices [87398.531579] blk: request botched [87398.535098] hdc: task_out_intr: status=0x50 { DriveReady SeekComplete } [87398.542129] ide: failed opcode was: unknown [87403.582775] ------------[ cut here ]------------ [87403.587748] kernel BUG at mm/filemap.c:541! [87403.592082] invalid opcode: 0000 [#1] [87403.596063] SMP [87403.598217] Modules linked in: w83781d hwmon_vid i2c_isa i2c_core w83627hf_wdt [87403.606114] CPU: 0 [87403.606117] EIP: 0060:[<c01406a7>] Not tainted VLI [87403.606120] EFLAGS: 00010046 (2.6.18.2debug #1) [87403.619728] EIP is at unlock_page+0x12/0x2d [87403.624170] eax: 00000000 ebx: c2d5caa8 ecx: e8148680 edx: c2d5caa8 [87403.631543] esi: da71c600 edi: 00000001 ebp: c04cfe28 esp: c04cfe24 [87403.638924] ds: 007b es: 007b ss: 0068 [87403.643419] Process swapper (pid: 0, ti=c04ce000 task=c041e500 task.ti=c04ce000) [87403.650774] Stack: e81487e8 c04cfe3c c0180e0a da71c600 00000000 c0180dac c04cfe64 c0164af9 [87403.659985] f7d49000 c04cfe84 f2dea5a0 f2dea5a0 00000000 da71c600 00000000 da71c600 [87403.669288] c04cfea8 c0256778 c041e500 00000000 c04cbd90 00000046 00000000 00000000 [87403.678603] Call Trace: [87403.681462] [<c0103bba>] show_stack_log_lvl+0x8d/0xaa [87403.686911] [<c0103ddc>] show_registers+0x1b0/0x221 [87403.692306] [<c0103ffc>] die+0x124/0x1ee [87403.696558] [<c0104165>] do_trap+0x9f/0xa1 [87403.700988] [<c0104427>] do_invalid_op+0xa7/0xb1 [87403.706012] [<c0103871>] error_code+0x39/0x40 [87403.710794] [<c0180e0a>] mpage_end_io_read+0x5e/0x72 [87403.716154] [<c0164af9>] bio_endio+0x56/0x7b [87403.720798] [<c0256778>] __end_that_request_first+0x1e0/0x301 [87403.726985] [<c02568a4>] end_that_request_first+0xb/0xd [87403.732699] [<c02bd73c>] __ide_end_request+0x54/0xe1 [87403.738214] [<c02bd807>] ide_end_request+0x3e/0x5c [87403.743382] [<c02c35df>] task_error+0x5b/0x97 [87403.748113] [<c02c36fa>] task_in_intr+0x6e/0xa2 [87403.753120] [<c02bf19e>] ide_intr+0xaf/0x12c [87403.757815] [<c013e5a7>] handle_IRQ_event+0x23/0x57 [87403.763135] [<c013e66f>] __do_IRQ+0x94/0xfd [87403.767802] [<c0105192>] do_IRQ+0x32/0x68 [87403.772278] [<c010372e>] common_interrupt+0x1a/0x20 [87403.777586] [<c0100cfe>] cpu_idle+0x7d/0x86 [87403.782184] [<c01002b7>] rest_init+0x23/0x25 [87403.786869] [<c04d4889>] start_kernel+0x175/0x19d [87403.791963] [<00000000>] 0x0 [87403.795270] Code: ff ff ff b9 0b 00 14 c0 8d f0 <0>Kernel panic - not syncing: Fatal exception in interrupt smp_call_function+0x12b/0x130 [87403.935200] [<c013e66f>] 2*) The new problem log: ... rebuilding new spare partitions into a 3-way array went well... [114683.721498] --- wd:3 rd:3 [114683.721508] disk 0, wo:0, o:1, dev:hdc10 [114683.721541] disk 1, wo:0, o:1, dev:hda10 [114683.721556] disk 2, wo:0, o:1, dev:hdd10 [115150.208060] RAID1 conf printout: [115150.211893] --- wd:3 rd:3 [115150.214923] disk 0, wo:0, o:1, dev:hdc8 [115150.219269] disk 1, wo:0, o:1, dev:hda8 [115150.224022] disk 2, wo:0, o:1, dev:hdd8 [115296.149849] RAID1 conf printout: [115296.149868] --- wd:3 rd:3 [115296.149895] disk 0, wo:0, o:1, dev:hdc7 [115296.149915] disk 1, wo:0, o:1, dev:hda7 [115296.149931] disk 2, wo:0, o:1, dev:hdd7 [115460.641824] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error } [115460.649294] hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=28433205, sector=28433023 [115460.658774] ide: failed opcode was: unknown [115460.663287] end_request: I/O error, dev hdc, sector 28433023 [115465.938995] hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error } [115465.946451] hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=28433201, sector=28433031 [115465.957536] ide: failed opcode was: unknown [115465.962041] end_request: I/O error, dev hdc, sector 28433031 [115488.743391] hdc: dma_timer_expiry: dma status == 0x60 [115488.748714] hdc: DMA timeout retry [115488.752450] hdc: timeout waiting for DMA [115488.979377] hdc: status error: status=0x58 { DriveReady SeekComplete DataRequest } [115488.987829] ide: failed opcode was: unknown [115488.992469] hdc: drive not ready for command [115518.991827] hdc: lost interrupt [115518.995201] hdc: task_out_intr: status=0x58 { DriveReady SeekComplete DataRequest } [115519.003493] ide: failed opcode was: unknown [115524.009004] hdc: status timeout: status=0xd0 { Busy } [115524.014441] ide: failed opcode was: unknown [115524.018988] hdc: drive not ready for command [115524.114548] ide1: reset: success [115524.198619] raid1: Disk failure on hdc6, disabling device. [115524.198624] Operation continuing on 2 devices [115524.209260] hdc: task_out_intr: status=0x50 { DriveReady SeekComplete } [115544.209530] hdc: dma_timer_expiry: dma status == 0x20 [115544.215027] hdc: DMA timeout retry [115544.218669] hdc: timeout waiting for DMA [115544.461490] hdc: status error: status=0x58 { DriveReady SeekComplete DataRequest } [115544.469875] ide: failed opcode was: unknown [115544.474388] hdc: drive not ready for command [115544.503556] RAID1 conf printout: [115544.507091] --- wd:1 rd:2 [115544.510120] disk 0, wo:0, o:1, dev:hdd1 [115544.712573] ------------[ cut here ]------------ [115544.717459] kernel BUG at mm/filemap.c:541! [115544.721902] invalid opcode: 0000 [#1] [115544.725806] SMP [115544.728015] Modules linked in: w83781d hwmon_vid i2c_isa i2c_core w83627hf_wdt [115544.735981] CPU: 0 [115544.735983] EIP: 0060:[<c013c436>] Not tainted VLI [115544.735986] EFLAGS: 00010046 (2.6.18.2server #4) [115544.749248] EIP is at unlock_page+0xf/0x28 [115544.753589] eax: 00000000 ebx: c1f5d8c0 ecx: eab25680 edx: c1f5d8c0 [115544.760627] esi: f6a78f00 edi: 00000001 ebp: 00000000 esp: c0481e7c [115544.767703] ds: 007b es: 007b ss: 0068 [115544.772033] Process swapper (pid: 0, ti=c0480000 task=c040e460 task.ti=c0480000) [115544.779512] Stack: eab25674 c017bed6 f6a78f00 00000000 c017be7c c016069e 00000003 c2e2aa20 [115544.788824] c0115cf1 00000020 f6a78f00 00003400 f6a78f00 00000000 c024f8ac f7f37a00 [115544.798171] f7f37a00 00000001 c04fcd94 00000001 00000000 00000000 00000000 00002c00 [115544.807683] Call Trace: [115544.810797] [<c017bed6>] mpage_end_io_read+0x5a/0x6e [115544.816407] [<c017be7c>] mpage_end_io_read+0x0/0x6e [115544.821865] [<c016069e>] bio_endio+0x5f/0x84 [115544.826749] [<c0115cf1>] find_busiest_group+0x153/0x4df [115544.832582] [<c024f8ac>] __end_that_request_first+0x1ec/0x31c [115544.838889] [<c02b508a>] __ide_end_request+0x56/0xe4 [115544.844462] [<c02b5159>] ide_end_request+0x41/0x65 [115544.849671] [<c02bad69>] task_end_request+0x37/0x7e [115544.854950] [<c02baed9>] task_out_intr+0x84/0xb9 [115544.860128] [<c01243bd>] del_timer+0x56/0x58 [115544.864839] [<c02bae55>] task_out_intr+0x0/0xb9 [115544.869760] [<c02b6a4b>] ide_intr+0xb7/0x132 [115544.874467] [<c013a2f4>] handle_IRQ_event+0x26/0x59 [115544.879740] [<c013a3ae>] __do_IRQ+0x87/0xed [115544.884368] [<c0105199>] do_IRQ+0x31/0x69 [115544.888843] [<c0103746>] common_interrupt+0x1a/0x20 [115544.894124] [<c0100c4f>] default_idle+0x35/0x5b [115544.899233] [<c0100d03>] cpu_idle+0x7a/0x83 [115544.903844] [<c0486810>] start_kernel+0x16a/0x194 [115544.908950] [<c0486286>] unknown_bootoption+0x0/0x1b6 [115544.914384] Code: e8 5e ff ff ff b9 __wake_up+0x38/0x4e [115545.083930] [<c01243bd>] -- Best regards, Jim Klimov mailto:klimov@2ka.mipt.ru ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2006-11-14 10:36 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-11-10 8:17 RAID1 submirror failure causes reboot? Jim Klimov 2006-11-10 8:41 ` Neil Brown 2006-11-10 12:53 ` Re[2]: " Jim Klimov 2006-11-13 7:17 ` Neil Brown 2006-11-13 20:11 ` Jens Axboe 2006-11-13 22:05 ` Neil Brown 2006-11-14 7:28 ` Jens Axboe 2006-11-14 10:36 ` Re[4]: " Jim Klimov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).