* What the heck happened to my array? (No apparent data loss). @ 2011-04-03 13:32 Brad Campbell 2011-04-03 15:47 ` Roberto Spadim 0 siblings, 1 reply; 12+ messages in thread From: Brad Campbell @ 2011-04-03 13:32 UTC (permalink / raw) To: linux-raid 2.6.38.2 x86_64 10 x 1TB SATA drives in a single RAID-6 Here is the chain of events. Saturday morning I started a reshape on a 10 element RAID-6. Simply changing the Chunk size from 512k to 64k. This was going to take about 4.5 days according to the initial estimates. I then went away for the weekend and came home to a wedged array. Here is the chain of events that caused it. This occurred about 1 minute after my scheduled morning SMART long (it is Sunday after all) began. Apr 3 03:19:08 srv kernel: [288180.455339] sd 0:0:12:0: [sdd] Unhandled error code Apr 3 03:19:08 srv kernel: [288180.455359] sd 0:0:12:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00 Apr 3 03:19:08 srv kernel: [288180.455377] sd 0:0:12:0: [sdd] CDB: cdb[0]=0x2a: 2a 00 00 00 00 08 00 00 02 00 Apr 3 03:19:08 srv kernel: [288180.455415] end_request: I/O error, dev sdd, sector 8 Apr 3 03:19:08 srv kernel: [288180.455449] end_request: I/O error, dev sdd, sector 8 Apr 3 03:19:08 srv kernel: [288180.455462] md: super_written gets error=-5, uptodate=0 Apr 3 03:19:08 srv kernel: [288180.455477] md/raid:md0: Disk failure on sdd, disabling device. Apr 3 03:19:08 srv kernel: [288180.455480] md/raid:md0: Operation continuing on 9 devices. Apr 3 03:19:08 srv kernel: [288180.472914] md: md0: reshape done. Apr 3 03:19:08 srv kernel: [288180.472983] md: delaying data-check of md5 until md3 has finished (they share one or more physical units) Apr 3 03:19:08 srv kernel: [288180.473002] md: delaying data-check of md4 until md6 has finished (they share one or more physical units) Apr 3 03:19:08 srv kernel: [288180.473030] md: delaying data-check of md6 until md5 has finished (they share one or more physical units) Apr 3 03:19:08 srv kernel: [288180.473047] md: delaying data-check of md3 until md1 has finished (they share one or more physical units) Apr 3 03:19:08 srv kernel: [288180.551450] md: reshape of RAID array md0 Apr 3 03:19:08 srv kernel: [288180.551468] md: minimum _guaranteed_ speed: 200000 KB/sec/disk. Apr 3 03:19:08 srv kernel: [288180.551483] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape. Apr 3 03:19:08 srv kernel: [288180.551514] md: using 128k window, over a total of 976759808 blocks. Apr 3 03:19:08 srv kernel: [288180.620089] sd 0:0:12:0: [sdd] Synchronizing SCSI cache Apr 3 03:19:08 srv mdadm[4803]: RebuildFinished event detected on md device /dev/md0 Apr 3 03:19:08 srv mdadm[4803]: Fail event detected on md device /dev/md0, component device /dev/sdd Apr 3 03:19:08 srv mdadm[4803]: RebuildStarted event detected on md device /dev/md0 Apr 3 03:19:10 srv kernel: [288182.614918] scsi 0:0:12:0: Direct-Access ATA MAXTOR STM310003 MX1A PQ: 0 ANSI: 5 Apr 3 03:19:10 srv kernel: [288182.615312] sd 0:0:12:0: Attached scsi generic sg3 type 0 Apr 3 03:19:10 srv kernel: [288182.618262] sd 0:0:12:0: [sdq] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB) Apr 3 03:19:10 srv kernel: [288182.736998] sd 0:0:12:0: [sdq] Write Protect is off Apr 3 03:19:10 srv kernel: [288182.737019] sd 0:0:12:0: [sdq] Mode Sense: 73 00 00 08 Apr 3 03:19:10 srv kernel: [288182.740521] sd 0:0:12:0: [sdq] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Apr 3 03:19:10 srv kernel: [288182.848999] sdq: unknown partition table Apr 3 03:19:10 srv ata_id[28453]: HDIO_GET_IDENTITY failed for '/dev/sdq' Apr 3 03:19:10 srv kernel: [288182.970091] sd 0:0:12:0: [sdq] Attached SCSI disk Apr 3 03:20:01 srv /USR/SBIN/CRON[28624]: (brad) CMD ([ -z "`/usr/bin/pgrep -u brad collect`" ] && /usr/bin/screen -X -S brad-bot screen /home/brad/bin/collect-thermostat) Apr 3 03:20:01 srv /USR/SBIN/CRON[28625]: (root) CMD ([ -z `/usr/bin/pgrep -u root keepalive` ] && /home/brad/bin/launch-keepalive) Apr 3 03:20:01 srv /USR/SBIN/CRON[28626]: (brad) CMD ([ -z "`screen -list | grep brad-bot`" ] && /home/brad/bin/botstart) Apr 3 03:20:01 srv /USR/SBIN/CRON[28628]: (root) CMD (if [ -x /usr/bin/mrtg ] && [ -r /etc/mrtg.cfg ]; then mkdir -p /var/log/mrtg ; env LANG=C /usr/bin/mrtg /etc/mrtg.cfg 2>&1 | tee -a /var/log/mrtg/mrtg.log ; fi) Apr 3 03:20:01 srv /USR/SBIN/CRON[28627]: (brad) CMD (/home/brad/rrd/rrd-create-graphs) Apr 3 03:20:01 srv /USR/SBIN/CRON[28590]: (CRON) error (grandchild #28625 failed with exit status 1) Apr 3 03:20:01 srv /USR/SBIN/CRON[28589]: (CRON) error (grandchild #28626 failed with exit status 1) Apr 3 03:20:01 srv /USR/SBIN/CRON[28587]: (CRON) error (grandchild #28624 failed with exit status 1) Apr 3 03:22:10 srv kernel: [288363.070094] INFO: task jbd2/md0-8:2647 blocked for more than 120 seconds. Apr 3 03:22:10 srv kernel: [288363.070114] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 3 03:22:10 srv kernel: [288363.070132] jbd2/md0-8 D ffff88041aa52948 0 2647 2 0x00000000 Apr 3 03:22:10 srv kernel: [288363.070154] ffff88041aa526f0 0000000000000046 0000000000000000 ffff8804196769b0 Apr 3 03:22:10 srv kernel: [288363.070178] 0000000000011180 ffff88041bdc5fd8 0000000000004000 0000000000011180 Apr 3 03:22:10 srv kernel: [288363.070201] ffff88041bdc4010 ffff88041aa52950 ffff88041bdc5fd8 ffff88041aa52948 Apr 3 03:22:10 srv kernel: [288363.070224] Call Trace: Apr 3 03:22:10 srv kernel: [288363.070246] [<ffffffff8104e4c6>] ? queue_work_on+0x16/0x20 Apr 3 03:22:10 srv kernel: [288363.070266] [<ffffffff812e6bfd>] ? md_write_start+0xad/0x190 Apr 3 03:22:10 srv kernel: [288363.070283] [<ffffffff81052b90>] ? autoremove_wake_function+0x0/0x30 Apr 3 03:22:10 srv kernel: [288363.070299] [<ffffffff812e16f5>] ? make_request+0x35/0x600 Apr 3 03:22:10 srv kernel: [288363.070317] [<ffffffff8108463b>] ? __alloc_pages_nodemask+0x10b/0x810 Apr 3 03:22:10 srv kernel: [288363.070335] [<ffffffff81142042>] ? T.1015+0x32/0x90 Apr 3 03:22:10 srv kernel: [288363.070350] [<ffffffff812e6a24>] ? md_make_request+0xd4/0x200 Apr 3 03:22:10 srv kernel: [288363.070366] [<ffffffff81142218>] ? ext4_map_blocks+0x178/0x210 Apr 3 03:22:10 srv kernel: [288363.070382] [<ffffffff811b6e84>] ? generic_make_request+0x144/0x2f0 Apr 3 03:22:10 srv kernel: [288363.070397] [<ffffffff8116e89d>] ? jbd2_journal_file_buffer+0x3d/0x70 Apr 3 03:22:10 srv kernel: [288363.070413] [<ffffffff811b708c>] ? submit_bio+0x5c/0xd0 Apr 3 03:22:10 srv kernel: [288363.070430] [<ffffffff810e61d5>] ? submit_bh+0xe5/0x120 Apr 3 03:22:10 srv kernel: [288363.070445] [<ffffffff811709b1>] ? jbd2_journal_commit_transaction+0x441/0x1180 Apr 3 03:22:10 srv kernel: [288363.070466] [<ffffffff81044893>] ? lock_timer_base+0x33/0x70 Apr 3 03:22:10 srv kernel: [288363.070480] [<ffffffff81052b90>] ? autoremove_wake_function+0x0/0x30 Apr 3 03:22:10 srv kernel: [288363.070498] [<ffffffff81174871>] ? kjournald2+0xb1/0x1e0 Apr 3 03:22:10 srv kernel: [288363.070511] [<ffffffff81052b90>] ? autoremove_wake_function+0x0/0x30 Apr 3 03:22:10 srv kernel: [288363.070527] [<ffffffff811747c0>] ? kjournald2+0x0/0x1e0 Apr 3 03:22:10 srv kernel: [288363.070544] [<ffffffff811747c0>] ? kjournald2+0x0/0x1e0 Apr 3 03:22:10 srv kernel: [288363.070557] [<ffffffff81052716>] ? kthread+0x96/0xa0 Apr 3 03:22:10 srv kernel: [288363.070573] [<ffffffff81003154>] ? kernel_thread_helper+0x4/0x10 Apr 3 03:22:10 srv kernel: [288363.070588] [<ffffffff81052680>] ? kthread+0x0/0xa0 Apr 3 03:22:10 srv kernel: [288363.070602] [<ffffffff81003150>] ? kernel_thread_helper+0x0/0x10 So apparently sdd suffered an unknown failure (it happens) and the array kicked it out (as it should). But 120 seconds later all tasks accessing that array trigger their 120 second hangcheck warning and are all suck in the D state. At the time the array was 12.1% of the way through a reshape. I had to reboot the machine to get it back up and it's now continuing the reshape on 9 drives. brad@srv:~$ cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md0 : active raid6 sdc[0] sdh[9] sda[8] sde[7] sdg[5] sdb[4] sdf[3] sdm[2] sdl[1] 7814078464 blocks super 1.2 level 6, 512k chunk, algorithm 2 [10/9] [UUUUUU_UUU] [===>.................] reshape = 16.5% (162091008/976759808) finish=5778.6min speed=2349K/sec To make matters more confusing the other arrays on the machine were in the middle of their "Debians first Sunday of every month" "check" scrub. I have the full syslog and can probably procure any other information that might be useful. I don't think I've lost any data, the machine continued reshaping and we're all moving along nicely. I just wanted to report it and offer assistance in diagnosing it should that be requested. Regards, Brad ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array? (No apparent data loss). 2011-04-03 13:32 What the heck happened to my array? (No apparent data loss) Brad Campbell @ 2011-04-03 15:47 ` Roberto Spadim 2011-04-04 5:59 ` Brad Campbell 0 siblings, 1 reply; 12+ messages in thread From: Roberto Spadim @ 2011-04-03 15:47 UTC (permalink / raw) To: Brad Campbell; +Cc: linux-raid what kernel version? more informations about your linux box? 2011/4/3 Brad Campbell <lists2009@fnarfbargle.com>: > 2.6.38.2 > x86_64 > 10 x 1TB SATA drives in a single RAID-6 > > Here is the chain of events. > > Saturday morning I started a reshape on a 10 element RAID-6. Simply changing > the Chunk size from 512k to 64k. This was going to take about 4.5 days > according to the initial estimates. > > I then went away for the weekend and came home to a wedged array. > Here is the chain of events that caused it. > > This occurred about 1 minute after my scheduled morning SMART long (it is > Sunday after all) began. > > Apr 3 03:19:08 srv kernel: [288180.455339] sd 0:0:12:0: [sdd] Unhandled > error code > Apr 3 03:19:08 srv kernel: [288180.455359] sd 0:0:12:0: [sdd] Result: > hostbyte=0x04 driverbyte=0x00 > Apr 3 03:19:08 srv kernel: [288180.455377] sd 0:0:12:0: [sdd] CDB: > cdb[0]=0x2a: 2a 00 00 00 00 08 00 00 02 00 > Apr 3 03:19:08 srv kernel: [288180.455415] end_request: I/O error, dev sdd, > sector 8 > Apr 3 03:19:08 srv kernel: [288180.455449] end_request: I/O error, dev sdd, > sector 8 > Apr 3 03:19:08 srv kernel: [288180.455462] md: super_written gets error=-5, > uptodate=0 > Apr 3 03:19:08 srv kernel: [288180.455477] md/raid:md0: Disk failure on > sdd, disabling device. > Apr 3 03:19:08 srv kernel: [288180.455480] md/raid:md0: Operation > continuing on 9 devices. > Apr 3 03:19:08 srv kernel: [288180.472914] md: md0: reshape done. > Apr 3 03:19:08 srv kernel: [288180.472983] md: delaying data-check of md5 > until md3 has finished (they share one or more physical units) > Apr 3 03:19:08 srv kernel: [288180.473002] md: delaying data-check of md4 > until md6 has finished (they share one or more physical units) > Apr 3 03:19:08 srv kernel: [288180.473030] md: delaying data-check of md6 > until md5 has finished (they share one or more physical units) > Apr 3 03:19:08 srv kernel: [288180.473047] md: delaying data-check of md3 > until md1 has finished (they share one or more physical units) > Apr 3 03:19:08 srv kernel: [288180.551450] md: reshape of RAID array md0 > Apr 3 03:19:08 srv kernel: [288180.551468] md: minimum _guaranteed_ speed: > 200000 KB/sec/disk. > Apr 3 03:19:08 srv kernel: [288180.551483] md: using maximum available idle > IO bandwidth (but not more than 200000 KB/sec) for reshape. > Apr 3 03:19:08 srv kernel: [288180.551514] md: using 128k window, over a > total of 976759808 blocks. > Apr 3 03:19:08 srv kernel: [288180.620089] sd 0:0:12:0: [sdd] Synchronizing > SCSI cache > Apr 3 03:19:08 srv mdadm[4803]: RebuildFinished event detected on md device > /dev/md0 > Apr 3 03:19:08 srv mdadm[4803]: Fail event detected on md device /dev/md0, > component device /dev/sdd > Apr 3 03:19:08 srv mdadm[4803]: RebuildStarted event detected on md device > /dev/md0 > Apr 3 03:19:10 srv kernel: [288182.614918] scsi 0:0:12:0: Direct-Access > ATA MAXTOR STM310003 MX1A PQ: 0 ANSI: 5 > Apr 3 03:19:10 srv kernel: [288182.615312] sd 0:0:12:0: Attached scsi > generic sg3 type 0 > Apr 3 03:19:10 srv kernel: [288182.618262] sd 0:0:12:0: [sdq] 1953525168 > 512-byte logical blocks: (1.00 TB/931 GiB) > Apr 3 03:19:10 srv kernel: [288182.736998] sd 0:0:12:0: [sdq] Write Protect > is off > Apr 3 03:19:10 srv kernel: [288182.737019] sd 0:0:12:0: [sdq] Mode Sense: > 73 00 00 08 > Apr 3 03:19:10 srv kernel: [288182.740521] sd 0:0:12:0: [sdq] Write cache: > enabled, read cache: enabled, doesn't support DPO or FUA > Apr 3 03:19:10 srv kernel: [288182.848999] sdq: unknown partition table > Apr 3 03:19:10 srv ata_id[28453]: HDIO_GET_IDENTITY failed for '/dev/sdq' > Apr 3 03:19:10 srv kernel: [288182.970091] sd 0:0:12:0: [sdq] Attached SCSI > disk > Apr 3 03:20:01 srv /USR/SBIN/CRON[28624]: (brad) CMD ([ -z "`/usr/bin/pgrep > -u brad collect`" ] && /usr/bin/screen -X -S brad-bot screen > /home/brad/bin/collect-thermostat) > Apr 3 03:20:01 srv /USR/SBIN/CRON[28625]: (root) CMD ([ -z `/usr/bin/pgrep > -u root keepalive` ] && /home/brad/bin/launch-keepalive) > Apr 3 03:20:01 srv /USR/SBIN/CRON[28626]: (brad) CMD ([ -z "`screen -list | > grep brad-bot`" ] && /home/brad/bin/botstart) > Apr 3 03:20:01 srv /USR/SBIN/CRON[28628]: (root) CMD (if [ -x /usr/bin/mrtg > ] && [ -r /etc/mrtg.cfg ]; then mkdir -p /var/log/mrtg ; env LANG=C > /usr/bin/mrtg /etc/mrtg.cfg 2>&1 | tee -a /var/log/mrtg/mrtg.log ; fi) > Apr 3 03:20:01 srv /USR/SBIN/CRON[28627]: (brad) CMD > (/home/brad/rrd/rrd-create-graphs) > Apr 3 03:20:01 srv /USR/SBIN/CRON[28590]: (CRON) error (grandchild #28625 > failed with exit status 1) > Apr 3 03:20:01 srv /USR/SBIN/CRON[28589]: (CRON) error (grandchild #28626 > failed with exit status 1) > Apr 3 03:20:01 srv /USR/SBIN/CRON[28587]: (CRON) error (grandchild #28624 > failed with exit status 1) > Apr 3 03:22:10 srv kernel: [288363.070094] INFO: task jbd2/md0-8:2647 > blocked for more than 120 seconds. > Apr 3 03:22:10 srv kernel: [288363.070114] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Apr 3 03:22:10 srv kernel: [288363.070132] jbd2/md0-8 D > ffff88041aa52948 0 2647 2 0x00000000 > Apr 3 03:22:10 srv kernel: [288363.070154] ffff88041aa526f0 > 0000000000000046 0000000000000000 ffff8804196769b0 > Apr 3 03:22:10 srv kernel: [288363.070178] 0000000000011180 > ffff88041bdc5fd8 0000000000004000 0000000000011180 > Apr 3 03:22:10 srv kernel: [288363.070201] ffff88041bdc4010 > ffff88041aa52950 ffff88041bdc5fd8 ffff88041aa52948 > Apr 3 03:22:10 srv kernel: [288363.070224] Call Trace: > Apr 3 03:22:10 srv kernel: [288363.070246] [<ffffffff8104e4c6>] ? > queue_work_on+0x16/0x20 > Apr 3 03:22:10 srv kernel: [288363.070266] [<ffffffff812e6bfd>] ? > md_write_start+0xad/0x190 > Apr 3 03:22:10 srv kernel: [288363.070283] [<ffffffff81052b90>] ? > autoremove_wake_function+0x0/0x30 > Apr 3 03:22:10 srv kernel: [288363.070299] [<ffffffff812e16f5>] ? > make_request+0x35/0x600 > Apr 3 03:22:10 srv kernel: [288363.070317] [<ffffffff8108463b>] ? > __alloc_pages_nodemask+0x10b/0x810 > Apr 3 03:22:10 srv kernel: [288363.070335] [<ffffffff81142042>] ? > T.1015+0x32/0x90 > Apr 3 03:22:10 srv kernel: [288363.070350] [<ffffffff812e6a24>] ? > md_make_request+0xd4/0x200 > Apr 3 03:22:10 srv kernel: [288363.070366] [<ffffffff81142218>] ? > ext4_map_blocks+0x178/0x210 > Apr 3 03:22:10 srv kernel: [288363.070382] [<ffffffff811b6e84>] ? > generic_make_request+0x144/0x2f0 > Apr 3 03:22:10 srv kernel: [288363.070397] [<ffffffff8116e89d>] ? > jbd2_journal_file_buffer+0x3d/0x70 > Apr 3 03:22:10 srv kernel: [288363.070413] [<ffffffff811b708c>] ? > submit_bio+0x5c/0xd0 > Apr 3 03:22:10 srv kernel: [288363.070430] [<ffffffff810e61d5>] ? > submit_bh+0xe5/0x120 > Apr 3 03:22:10 srv kernel: [288363.070445] [<ffffffff811709b1>] ? > jbd2_journal_commit_transaction+0x441/0x1180 > Apr 3 03:22:10 srv kernel: [288363.070466] [<ffffffff81044893>] ? > lock_timer_base+0x33/0x70 > Apr 3 03:22:10 srv kernel: [288363.070480] [<ffffffff81052b90>] ? > autoremove_wake_function+0x0/0x30 > Apr 3 03:22:10 srv kernel: [288363.070498] [<ffffffff81174871>] ? > kjournald2+0xb1/0x1e0 > Apr 3 03:22:10 srv kernel: [288363.070511] [<ffffffff81052b90>] ? > autoremove_wake_function+0x0/0x30 > Apr 3 03:22:10 srv kernel: [288363.070527] [<ffffffff811747c0>] ? > kjournald2+0x0/0x1e0 > Apr 3 03:22:10 srv kernel: [288363.070544] [<ffffffff811747c0>] ? > kjournald2+0x0/0x1e0 > Apr 3 03:22:10 srv kernel: [288363.070557] [<ffffffff81052716>] ? > kthread+0x96/0xa0 > Apr 3 03:22:10 srv kernel: [288363.070573] [<ffffffff81003154>] ? > kernel_thread_helper+0x4/0x10 > Apr 3 03:22:10 srv kernel: [288363.070588] [<ffffffff81052680>] ? > kthread+0x0/0xa0 > Apr 3 03:22:10 srv kernel: [288363.070602] [<ffffffff81003150>] ? > kernel_thread_helper+0x0/0x10 > > So apparently sdd suffered an unknown failure (it happens) and the array > kicked it out (as it should). But 120 seconds later all tasks accessing that > array trigger their 120 second hangcheck warning and are all suck in the D > state. > > At the time the array was 12.1% of the way through a reshape. I had to > reboot the machine to get it back up and it's now continuing the reshape on > 9 drives. > > brad@srv:~$ cat /proc/mdstat > Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] > md0 : active raid6 sdc[0] sdh[9] sda[8] sde[7] sdg[5] sdb[4] sdf[3] sdm[2] > sdl[1] > 7814078464 blocks super 1.2 level 6, 512k chunk, algorithm 2 [10/9] > [UUUUUU_UUU] > [===>.................] reshape = 16.5% (162091008/976759808) > finish=5778.6min speed=2349K/sec > > > > To make matters more confusing the other arrays on the machine were in the > middle of their "Debians first Sunday of every month" "check" scrub. > > I have the full syslog and can probably procure any other information that > might be useful. I don't think I've lost any data, the machine continued > reshaping and we're all moving along nicely. I just wanted to report it and > offer assistance in diagnosing it should that be requested. > > Regards, > Brad > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Roberto Spadim Spadim Technology / SPAEmpresarial -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array? (No apparent data loss). 2011-04-03 15:47 ` Roberto Spadim @ 2011-04-04 5:59 ` Brad Campbell 2011-04-04 16:49 ` Roberto Spadim 0 siblings, 1 reply; 12+ messages in thread From: Brad Campbell @ 2011-04-04 5:59 UTC (permalink / raw) To: Roberto Spadim; +Cc: linux-raid On 03/04/11 23:47, Roberto Spadim wrote: > what kernel version? more informations about your linux box? The kernel version and architecture were the first 2 lines of the E-mail you top posted over. What would you like to know about the box? It's a 6 core Phenom-II with 16G of ram. 2 LSI SAS 9240 controllers configured with 10 x 1TB SATA Drives in a RAID-6(md0) & 3 x 750GB SATA drives in a RAID-5(md2). The boot drives are a pair of 1TB SATA drives in multiple RAID-1's using the on-board AMD chipset controller and there is a 64GB SSD on a separate PCI-E Marvell 7042m Controller. The array in question is : root@srv:~# mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Sat Jan 8 11:25:17 2011 Raid Level : raid6 Array Size : 7814078464 (7452.09 GiB 8001.62 GB) Used Dev Size : 976759808 (931.51 GiB 1000.20 GB) Raid Devices : 10 Total Devices : 9 Persistence : Superblock is persistent Update Time : Mon Apr 4 13:53:59 2011 State : clean, degraded, recovering Active Devices : 9 Working Devices : 9 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Reshape Status : 29% complete New Chunksize : 64K Name : srv:server (local to host srv) UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e Events : 429198 Number Major Minor RaidDevice State 0 8 32 0 active sync /dev/sdc 1 8 176 1 active sync /dev/sdl 2 8 192 2 active sync /dev/sdm 3 8 80 3 active sync /dev/sdf 4 8 16 4 active sync /dev/sdb 5 8 96 5 active sync /dev/sdg 6 0 0 6 removed 7 8 64 7 active sync /dev/sde 8 8 0 8 active sync /dev/sda 9 8 112 9 active sync /dev/sdh root@srv:~# Subsequent investigation has shown sdd has a pending reallocation and I can only assume the unidentified IO error was as a result of tripping up on that. It still does not explain why all IO to the array froze after the drive was kicked. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array? (No apparent data loss). 2011-04-04 5:59 ` Brad Campbell @ 2011-04-04 16:49 ` Roberto Spadim 2011-04-05 0:47 ` What the heck happened to my array? Brad Campbell 0 siblings, 1 reply; 12+ messages in thread From: Roberto Spadim @ 2011-04-04 16:49 UTC (permalink / raw) To: Brad Campbell; +Cc: linux-raid i don´t know but this happened with me on a hp server, with linux 2,6,37 i changed kernel to a older release and the problem ended, check with neil and others md guys what´s the real problem maybe realtime module and others changes inside kernel are the problem, maybe not... just a quick solution idea: try a older kernel 2011/4/4 Brad Campbell <lists2009@fnarfbargle.com>: > On 03/04/11 23:47, Roberto Spadim wrote: >> >> what kernel version? more informations about your linux box? > > The kernel version and architecture were the first 2 lines of the E-mail you > top posted over. > > What would you like to know about the box? It's a 6 core Phenom-II with 16G > of ram. 2 LSI SAS 9240 controllers configured with 10 x 1TB SATA Drives in a > RAID-6(md0) & 3 x 750GB SATA drives in a RAID-5(md2). > > The boot drives are a pair of 1TB SATA drives in multiple RAID-1's using the > on-board AMD chipset controller and there is a 64GB SSD on a separate PCI-E > Marvell 7042m Controller. > > The array in question is : > > root@srv:~# mdadm --detail /dev/md0 > /dev/md0: > Version : 1.2 > Creation Time : Sat Jan 8 11:25:17 2011 > Raid Level : raid6 > Array Size : 7814078464 (7452.09 GiB 8001.62 GB) > Used Dev Size : 976759808 (931.51 GiB 1000.20 GB) > Raid Devices : 10 > Total Devices : 9 > Persistence : Superblock is persistent > > Update Time : Mon Apr 4 13:53:59 2011 > State : clean, degraded, recovering > Active Devices : 9 > Working Devices : 9 > Failed Devices : 0 > Spare Devices : 0 > > Layout : left-symmetric > Chunk Size : 512K > > Reshape Status : 29% complete > New Chunksize : 64K > > Name : srv:server (local to host srv) > UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e > Events : 429198 > > Number Major Minor RaidDevice State > 0 8 32 0 active sync /dev/sdc > 1 8 176 1 active sync /dev/sdl > 2 8 192 2 active sync /dev/sdm > 3 8 80 3 active sync /dev/sdf > 4 8 16 4 active sync /dev/sdb > 5 8 96 5 active sync /dev/sdg > 6 0 0 6 removed > 7 8 64 7 active sync /dev/sde > 8 8 0 8 active sync /dev/sda > 9 8 112 9 active sync /dev/sdh > root@srv:~# > > Subsequent investigation has shown sdd has a pending reallocation and I can > only assume the unidentified IO error was as a result of tripping up on > that. It still does not explain why all IO to the array froze after the > drive was kicked. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Roberto Spadim Spadim Technology / SPAEmpresarial -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array? 2011-04-04 16:49 ` Roberto Spadim @ 2011-04-05 0:47 ` Brad Campbell 2011-04-05 6:10 ` NeilBrown 0 siblings, 1 reply; 12+ messages in thread From: Brad Campbell @ 2011-04-05 0:47 UTC (permalink / raw) To: linux-raid; +Cc: neilb On 05/04/11 00:49, Roberto Spadim wrote: > i don´t know but this happened with me on a hp server, with linux > 2,6,37 i changed kernel to a older release and the problem ended, > check with neil and others md guys what´s the real problem > maybe realtime module and others changes inside kernel are the > problem, maybe not... > just a quick solution idea: try a older kernel > Quick precis: - Started reshape 512k to 64k chunk size. - sdd got bad sector and was kicked. - Array froze all IO. - Reboot required to get system back. - Restarted reshape with 9 drives. - sdl suffered IO error and was kicked - Array froze all IO. - Reboot required to get system back. - Array will no longer mount with 8/10 drives. - Mdadm 3.1.5 segfaults when trying to start reshape. Naively tried to run it under gdb to get a backtrace but was unable to stop it forking - Got array started with mdadm 3.2.1 - Attempted to re-add sdd/sdl (now marked as spares) root@srv:~/mdadm-3.1.5# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md0 : active raid6 sdl[1](S) sdd[6](S) sdc[0] sdh[9] sda[8] sde[7] sdg[5] sdb[4] sdf[3] sdm[2] 7814078464 blocks super 1.2 level 6, 512k chunk, algorithm 2 [10/8] [U_UUUU_UUU] resync=DELAYED md2 : active raid5 sdi[0] sdk[3] sdj[1] 1465146368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU] md6 : active raid1 sdo6[0] sdn6[1] 821539904 blocks [2/2] [UU] md5 : active raid1 sdo5[0] sdn5[1] 104864192 blocks [2/2] [UU] md4 : active raid1 sdo3[0] sdn3[1] 20980800 blocks [2/2] [UU] md3 : active (auto-read-only) raid1 sdo2[0] sdn2[1] 8393856 blocks [2/2] [UU] md1 : active raid1 sdo1[0] sdn1[1] 20980736 blocks [2/2] [UU] unused devices: <none> [ 303.640776] md: bind<sdl> [ 303.677461] md: bind<sdm> [ 303.837358] md: bind<sdf> [ 303.846291] md: bind<sdb> [ 303.851476] md: bind<sdg> [ 303.860725] md: bind<sdd> [ 303.861055] md: bind<sde> [ 303.861982] md: bind<sda> [ 303.862830] md: bind<sdh> [ 303.863128] md: bind<sdc> [ 303.863306] md: kicking non-fresh sdd from array! [ 303.863353] md: unbind<sdd> [ 303.900207] md: export_rdev(sdd) [ 303.900260] md: kicking non-fresh sdl from array! [ 303.900306] md: unbind<sdl> [ 303.940100] md: export_rdev(sdl) [ 303.942181] md/raid:md0: reshape will continue [ 303.942242] md/raid:md0: device sdc operational as raid disk 0 [ 303.942285] md/raid:md0: device sdh operational as raid disk 9 [ 303.942327] md/raid:md0: device sda operational as raid disk 8 [ 303.942368] md/raid:md0: device sde operational as raid disk 7 [ 303.942409] md/raid:md0: device sdg operational as raid disk 5 [ 303.942449] md/raid:md0: device sdb operational as raid disk 4 [ 303.942490] md/raid:md0: device sdf operational as raid disk 3 [ 303.942531] md/raid:md0: device sdm operational as raid disk 2 [ 303.943733] md/raid:md0: allocated 10572kB [ 303.943866] md/raid:md0: raid level 6 active with 8 out of 10 devices, algorithm 2 [ 303.943912] RAID conf printout: [ 303.943916] --- level:6 rd:10 wd:8 [ 303.943920] disk 0, o:1, dev:sdc [ 303.943924] disk 2, o:1, dev:sdm [ 303.943927] disk 3, o:1, dev:sdf [ 303.943931] disk 4, o:1, dev:sdb [ 303.943934] disk 5, o:1, dev:sdg [ 303.943938] disk 7, o:1, dev:sde [ 303.943941] disk 8, o:1, dev:sda [ 303.943945] disk 9, o:1, dev:sdh [ 303.944061] md0: detected capacity change from 0 to 8001616347136 [ 303.944366] md: md0 switched to read-write mode. [ 303.944427] md: reshape of RAID array md0 [ 303.944469] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. [ 303.944511] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape. [ 303.944573] md: using 128k window, over a total of 976759808 blocks. [ 304.054875] md0: unknown partition table [ 304.393245] mdadm[5940]: segfault at 7f2000 ip 00000000004480d2 sp 00007fffa04777b8 error 4 in mdadm[400000+64000] root@srv:~# mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Sat Jan 8 11:25:17 2011 Raid Level : raid6 Array Size : 7814078464 (7452.09 GiB 8001.62 GB) Used Dev Size : 976759808 (931.51 GiB 1000.20 GB) Raid Devices : 10 Total Devices : 10 Persistence : Superblock is persistent Update Time : Tue Apr 5 07:54:30 2011 State : active, degraded Active Devices : 8 Working Devices : 10 Failed Devices : 0 Spare Devices : 2 Layout : left-symmetric Chunk Size : 512K New Chunksize : 64K Name : srv:server (local to host srv) UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e Events : 633835 Number Major Minor RaidDevice State 0 8 32 0 active sync /dev/sdc 1 0 0 1 removed 2 8 192 2 active sync /dev/sdm 3 8 80 3 active sync /dev/sdf 4 8 16 4 active sync /dev/sdb 5 8 96 5 active sync /dev/sdg 6 0 0 6 removed 7 8 64 7 active sync /dev/sde 8 8 0 8 active sync /dev/sda 9 8 112 9 active sync /dev/sdh 1 8 176 - spare /dev/sdl 6 8 48 - spare /dev/sdd root@srv:~# for i in /dev/sd? ; do mdadm --examine $i ; done /dev/sda: Magic : a92b4efc Version : 1.2 Feature Map : 0x4 Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e Name : srv:server (local to host srv) Creation Time : Sat Jan 8 11:25:17 2011 Raid Level : raid6 Raid Devices : 10 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB) Array Size : 15628156928 (7452.09 GiB 8001.62 GB) Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : active Device UUID : 9beb9a0f:2a73328c:f0c17909:89da70fd Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB) New Chunksize : 64K Update Time : Tue Apr 5 07:54:30 2011 Checksum : c58ed095 - correct Events : 633835 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 8 Array State : A.AAAA.AAA ('A' == active, '.' == missing) /dev/sdb: Magic : a92b4efc Version : 1.2 Feature Map : 0x4 Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e Name : srv:server (local to host srv) Creation Time : Sat Jan 8 11:25:17 2011 Raid Level : raid6 Raid Devices : 10 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB) Array Size : 15628156928 (7452.09 GiB 8001.62 GB) Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : active Device UUID : 75d997f8:d9372d90:c068755b:81c8206b Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB) New Chunksize : 64K Update Time : Tue Apr 5 07:54:30 2011 Checksum : 72321703 - correct Events : 633835 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 4 Array State : A.AAAA.AAA ('A' == active, '.' == missing) /dev/sdc: Magic : a92b4efc Version : 1.2 Feature Map : 0x4 Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e Name : srv:server (local to host srv) Creation Time : Sat Jan 8 11:25:17 2011 Raid Level : raid6 Raid Devices : 10 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB) Array Size : 15628156928 (7452.09 GiB 8001.62 GB) Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : active Device UUID : 5738a232:85f23a16:0c7a9454:d770199c Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB) New Chunksize : 64K Update Time : Tue Apr 5 07:54:30 2011 Checksum : 5c61ea2e - correct Events : 633835 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 0 Array State : A.AAAA.AAA ('A' == active, '.' == missing) /dev/sdd: Magic : a92b4efc Version : 1.2 Feature Map : 0x4 Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e Name : srv:server (local to host srv) Creation Time : Sat Jan 8 11:25:17 2011 Raid Level : raid6 Raid Devices : 10 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB) Array Size : 15628156928 (7452.09 GiB 8001.62 GB) Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : active Device UUID : 83a2c731:ba2846d0:2ce97d83:de624339 Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB) New Chunksize : 64K Update Time : Tue Apr 5 07:54:30 2011 Checksum : e1a5ebbc - correct Events : 633835 Layout : left-symmetric Chunk Size : 512K Device Role : spare Array State : A.AAAA.AAA ('A' == active, '.' == missing) /dev/sde: Magic : a92b4efc Version : 1.2 Feature Map : 0x4 Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e Name : srv:server (local to host srv) Creation Time : Sat Jan 8 11:25:17 2011 Raid Level : raid6 Raid Devices : 10 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB) Array Size : 15628156928 (7452.09 GiB 8001.62 GB) Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : active Device UUID : f1e3c1d3:ea9dc52e:a4e6b70e:e25a0321 Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB) New Chunksize : 64K Update Time : Tue Apr 5 07:54:30 2011 Checksum : 551997d7 - correct Events : 633835 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 7 Array State : A.AAAA.AAA ('A' == active, '.' == missing) /dev/sdf: Magic : a92b4efc Version : 1.2 Feature Map : 0x4 Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e Name : srv:server (local to host srv) Creation Time : Sat Jan 8 11:25:17 2011 Raid Level : raid6 Raid Devices : 10 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB) Array Size : 15628156928 (7452.09 GiB 8001.62 GB) Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : active Device UUID : c32dff71:0b8c165c:9f589b0f:bcbc82da Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB) New Chunksize : 64K Update Time : Tue Apr 5 07:54:30 2011 Checksum : db0aa39b - correct Events : 633835 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 3 Array State : A.AAAA.AAA ('A' == active, '.' == missing) /dev/sdg: Magic : a92b4efc Version : 1.2 Feature Map : 0x4 Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e Name : srv:server (local to host srv) Creation Time : Sat Jan 8 11:25:17 2011 Raid Level : raid6 Raid Devices : 10 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB) Array Size : 15628156928 (7452.09 GiB 8001.62 GB) Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : active Device UUID : 194bc75c:97d3f507:4915b73a:51a50172 Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB) New Chunksize : 64K Update Time : Tue Apr 5 07:54:30 2011 Checksum : 344cadbe - correct Events : 633835 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 5 Array State : A.AAAA.AAA ('A' == active, '.' == missing) /dev/sdh: Magic : a92b4efc Version : 1.2 Feature Map : 0x4 Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e Name : srv:server (local to host srv) Creation Time : Sat Jan 8 11:25:17 2011 Raid Level : raid6 Raid Devices : 10 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB) Array Size : 15628156928 (7452.09 GiB 8001.62 GB) Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : active Device UUID : 1326457e:4fc0a6be:0073ccae:398d5c7f Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB) New Chunksize : 64K Update Time : Tue Apr 5 07:54:30 2011 Checksum : 8debbb14 - correct Events : 633835 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 9 Array State : A.AAAA.AAA ('A' == active, '.' == missing) /dev/sdi: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : e39d73c3:75be3b52:44d195da:b240c146 Name : srv:2 (local to host srv) Creation Time : Sat Jul 10 21:14:29 2010 Raid Level : raid5 Raid Devices : 3 Avail Dev Size : 1465147120 (698.64 GiB 750.16 GB) Array Size : 2930292736 (1397.27 GiB 1500.31 GB) Used Dev Size : 1465146368 (698.64 GiB 750.15 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : b577b308:56f2e4c9:c78175f4:cf10c77f Update Time : Tue Apr 5 07:46:18 2011 Checksum : 57ee683f - correct Events : 455775 Layout : left-symmetric Chunk Size : 64K Device Role : Active device 0 Array State : AAA ('A' == active, '.' == missing) /dev/sdj: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : e39d73c3:75be3b52:44d195da:b240c146 Name : srv:2 (local to host srv) Creation Time : Sat Jul 10 21:14:29 2010 Raid Level : raid5 Raid Devices : 3 Avail Dev Size : 1465147120 (698.64 GiB 750.16 GB) Array Size : 2930292736 (1397.27 GiB 1500.31 GB) Used Dev Size : 1465146368 (698.64 GiB 750.15 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : b127f002:a4aa8800:735ef8d7:6018564e Update Time : Tue Apr 5 07:46:18 2011 Checksum : 3ae0b4c6 - correct Events : 455775 Layout : left-symmetric Chunk Size : 64K Device Role : Active device 1 Array State : AAA ('A' == active, '.' == missing) /dev/sdk: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : e39d73c3:75be3b52:44d195da:b240c146 Name : srv:2 (local to host srv) Creation Time : Sat Jul 10 21:14:29 2010 Raid Level : raid5 Raid Devices : 3 Avail Dev Size : 1465147120 (698.64 GiB 750.16 GB) Array Size : 2930292736 (1397.27 GiB 1500.31 GB) Used Dev Size : 1465146368 (698.64 GiB 750.15 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : 90fddf63:03d5dba4:3fcdc476:9ce3c44c Update Time : Tue Apr 5 07:46:18 2011 Checksum : dd5eef0e - correct Events : 455775 Layout : left-symmetric Chunk Size : 64K Device Role : Active device 2 Array State : AAA ('A' == active, '.' == missing) /dev/sdl: Magic : a92b4efc Version : 1.2 Feature Map : 0x4 Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e Name : srv:server (local to host srv) Creation Time : Sat Jan 8 11:25:17 2011 Raid Level : raid6 Raid Devices : 10 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB) Array Size : 15628156928 (7452.09 GiB 8001.62 GB) Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : active Device UUID : 769940af:66733069:37cea27d:7fb28a23 Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB) New Chunksize : 64K Update Time : Tue Apr 5 07:54:30 2011 Checksum : dc756202 - correct Events : 633835 Layout : left-symmetric Chunk Size : 512K Device Role : spare Array State : A.AAAA.AAA ('A' == active, '.' == missing) /dev/sdm: Magic : a92b4efc Version : 1.2 Feature Map : 0x4 Array UUID : d00a11d7:fe0435af:07c8d4d6:e3b8e34e Name : srv:server (local to host srv) Creation Time : Sat Jan 8 11:25:17 2011 Raid Level : raid6 Raid Devices : 10 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB) Array Size : 15628156928 (7452.09 GiB 8001.62 GB) Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : active Device UUID : 7e564e2c:7f21125b:c3b1907a:b640178f Reshape pos'n : 3437035520 (3277.81 GiB 3519.52 GB) New Chunksize : 64K Update Time : Tue Apr 5 07:54:30 2011 Checksum : b3df3ee7 - correct Events : 633835 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 2 Array State : A.AAAA.AAA ('A' == active, '.' == missing) root@srv:~/mdadm-3.1.5# ./mdadm --version mdadm - v3.1.5 - 23rd March 2011 root@srv:~/mdadm-3.1.5# uname -a Linux srv 2.6.38 #19 SMP Wed Mar 23 09:57:05 WST 2011 x86_64 GNU/Linux Now. The array restarted with mdadm 3.2.1, but of course its now reshaping 8 out of 10 disks, has no redundancy and is going at 600k/s which will take over 10 days. Is there anything I can do to give it some redundancy while it completes or am I better to copy the data off, blow it away and start again? All the important stuff is backed up anyway, I just wanted to avoid restoring 8TB from backup if I could. Regards, Brad -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array? 2011-04-05 0:47 ` What the heck happened to my array? Brad Campbell @ 2011-04-05 6:10 ` NeilBrown 2011-04-05 9:02 ` Brad Campbell 2011-04-08 1:19 ` Brad Campbell 0 siblings, 2 replies; 12+ messages in thread From: NeilBrown @ 2011-04-05 6:10 UTC (permalink / raw) To: Brad Campbell; +Cc: linux-raid On Tue, 05 Apr 2011 08:47:16 +0800 Brad Campbell <lists2009@fnarfbargle.com> wrote: > On 05/04/11 00:49, Roberto Spadim wrote: > > i don´t know but this happened with me on a hp server, with linux > > 2,6,37 i changed kernel to a older release and the problem ended, > > check with neil and others md guys what´s the real problem > > maybe realtime module and others changes inside kernel are the > > problem, maybe not... > > just a quick solution idea: try a older kernel > > > > Quick precis: > - Started reshape 512k to 64k chunk size. > - sdd got bad sector and was kicked. > - Array froze all IO. That .... shouldn't happen. But I know why it did. mdadm forks and runs in the back ground monitoring the reshape. It suspends IO to a region of the array, backs up the data, then lets the reshape progress over that region, then invalidates the backup and allows IO to resume, then moves on to the next region (it actually have two regions in different states at the same time, but you get the idea). If the device failed the reshape in the kernel aborted and then restarted. It is meant to do this - restore to a known state, then decide if there is anything useful to do. It restarts exactly where it left off so all should be fine. mdadm periodically checks the value in 'sync_completed' to see how far the reshape has progressed to know if it can move on. If it checks while the reshape is temporarily aborted it sees 'none', which is not a number, so it aborts. That should be fixed. It aborts with IO to a region still suspended so it is very possible for IO to freeze if anything is destined for that region. > - Reboot required to get system back. > - Restarted reshape with 9 drives. > - sdl suffered IO error and was kicked Very sad. > - Array froze all IO. Same thing... > - Reboot required to get system back. > - Array will no longer mount with 8/10 drives. > - Mdadm 3.1.5 segfaults when trying to start reshape. Don't know why it would have done that... I cannot reproduce it easily. > Naively tried to run it under gdb to get a backtrace but was unable > to stop it forking Yes, tricky .... an "strace -o /tmp/file -f mdadm ...." might have been enough, but to late to worry about that now. > - Got array started with mdadm 3.2.1 > - Attempted to re-add sdd/sdl (now marked as spares) Hmm... it isn't meant to do that any more. I thought I fixed it so that it if a device looked like part of the array it wouldn't add it as a spare... Obviously that didn't work. I'd better look in to it again. > [ 304.393245] mdadm[5940]: segfault at 7f2000 ip 00000000004480d2 sp > 00007fffa04777b8 error 4 in mdadm[400000+64000] > If you have the exact mdadm binary that caused this segfault we should be able to figure out what instruction was at 0004480d2. If you don't feel up to it, could you please email me the file privately and I'll have a look. > root@srv:~/mdadm-3.1.5# uname -a > Linux srv 2.6.38 #19 SMP Wed Mar 23 09:57:05 WST 2011 x86_64 GNU/Linux > > Now. The array restarted with mdadm 3.2.1, but of course its now > reshaping 8 out of 10 disks, has no redundancy and is going at 600k/s > which will take over 10 days. Is there anything I can do to give it some > redundancy while it completes or am I better to copy the data off, blow > it away and start again? All the important stuff is backed up anyway, I > just wanted to avoid restoring 8TB from backup if I could. No, you cannot give it extra redundancy. I would suggest: copy anything that you need off, just in case - if you can. Kill the mdadm that is running in the back ground. This will mean that if the machine crashes your array will be corrupted, but you are thinking of rebuilding it any, so that isn't the end of the world. In /sys/block/md0/md cat suspend_hi > suspend_lo cat component_size > sync_max That will allow the reshape to continue without any backup. It will be much faster (but less safe, as I said). If the reshape completes without incident, it will start recovering to the two 'spares' - and then you will have a happy array again. If something goes wrong, you will need to scrap the array, recreate it, and copy data back from where-ever you copied it to (or backups). If anything there doesn't make sense, or doesn't seem to work - please ask. Thanks for the report. I'll try to get those mdadm issues addressed - particularly if you can get me the mdadm file which caused the segfault. NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array? 2011-04-05 6:10 ` NeilBrown @ 2011-04-05 9:02 ` Brad Campbell 2011-04-05 11:31 ` NeilBrown 2011-04-08 1:19 ` Brad Campbell 1 sibling, 1 reply; 12+ messages in thread From: Brad Campbell @ 2011-04-05 9:02 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid On 05/04/11 14:10, NeilBrown wrote: >> - Reboot required to get system back. >> - Restarted reshape with 9 drives. >> - sdl suffered IO error and was kicked > > Very sad. I'd say pretty damn unlucky actually. >> - Array froze all IO. > > Same thing... > >> - Reboot required to get system back. >> - Array will no longer mount with 8/10 drives. >> - Mdadm 3.1.5 segfaults when trying to start reshape. > > Don't know why it would have done that... I cannot reproduce it easily. No. I tried numerous incantations. The system version of mdadm is Debian 3.1.4. This segfaulted so I downloaded and compiled 3.1.5 which did the same thing. I then composed most of this E-mail, made *really* sure my backups were up to date and tried 3.2.1 which to my astonishment worked. It's been ticking along _slowly_ ever since. >> Naively tried to run it under gdb to get a backtrace but was unable >> to stop it forking > > Yes, tricky .... an "strace -o /tmp/file -f mdadm ...." might have been > enough, but to late to worry about that now. I wondered about using strace but for some reason got it into my head that a gdb backtrace would be more useful. Then of course I got it started with 3.2.1 and have not tried again. >> - Got array started with mdadm 3.2.1 >> - Attempted to re-add sdd/sdl (now marked as spares) > > Hmm... it isn't meant to do that any more. I thought I fixed it so that it > if a device looked like part of the array it wouldn't add it as a spare... > Obviously that didn't work. I'd better look in to it again. Now the chain of events that led up to this was along these lines. - Rebooted machine. - Tried to --assemble with 3.1.4 - mdadm told me it did not really want to continue with 8/10 devices and I should use --force if I really wanted it to try. - I used --force - I did a mdadm --add /dev/md0 /dev/sdd and the same for sdl - I checked and they were listed as spares. So this was all done with Debian's mdadm 3.1.4, *not* 3.1.5 > > No, you cannot give it extra redundancy. > I would suggest: > copy anything that you need off, just in case - if you can. > > Kill the mdadm that is running in the back ground. This will mean that > if the machine crashes your array will be corrupted, but you are thinking > of rebuilding it any, so that isn't the end of the world. > In /sys/block/md0/md > cat suspend_hi> suspend_lo > cat component_size> sync_max > > That will allow the reshape to continue without any backup. It will be > much faster (but less safe, as I said). Well, I have nothing to lose, but I've just picked up some extra drives so I'll make second backups and then give this a whirl. > If something goes wrong, you will need to scrap the array, recreate it, and > copy data back from where-ever you copied it to (or backups). I did go into this with the niggling feeling that something bad might happen, so I made sure all my backups were up to date before I started. No biggie if it does die. The very odd thing is I did a complete array check, plus SMART long tests on all drives literally hours before I started the reshape. Goes to show how ropey these large drives can be in big(iash) arrays. > If anything there doesn't make sense, or doesn't seem to work - please ask. > > Thanks for the report. I'll try to get those mdadm issues addressed - > particularly if you can get me the mdadm file which caused the segfault. > Well, luckily I preserved the entire build tree then. I was planning on running nm over the binary and have a two thumbs type of look into it with gdb, but seeing as you probably have a much better idea what you are looking for I'll just send you the binary! Thanks for the help Neil. Much appreciated. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array? 2011-04-05 9:02 ` Brad Campbell @ 2011-04-05 11:31 ` NeilBrown 2011-04-05 11:47 ` Brad Campbell 0 siblings, 1 reply; 12+ messages in thread From: NeilBrown @ 2011-04-05 11:31 UTC (permalink / raw) To: Brad Campbell; +Cc: linux-raid On Tue, 05 Apr 2011 17:02:43 +0800 Brad Campbell <lists2009@fnarfbargle.com> wrote: > Well, luckily I preserved the entire build tree then. I was planning on > running nm over the binary and have a two thumbs type of look into it > with gdb, but seeing as you probably have a much better idea what you > are looking for I'll just send you the binary! Thanks. It took me a little while, but I've found the problem. The code was failing at wd0 = sources[z][d]; in qsyndrome in restripe.c. It is looking up 'd' in 'sources[z]' and having problems. The error address (from dmesg) is 0x7f2000 so it isn't a NULL pointer, but rather it is falling off the end of an allocation. When doing qsyndrome calculations we often need a block full of zeros, so restripe.c allocates one and stores it in in a global pointer. You were restriping from 512K to 64K chunk size. The first thing restripe.c was called on to do was to restore data from the backup file into the array. This uses the new chunk size - 64K. So the 'zero' buffer was allocated at 64K and cleared. The next thing it does is read the next section of the array and write it to the backup. As the array was missing 2 devices it needed to do a qsyndrome calculation to get the missing data block(s). This was a calculation done on old-style chunks so it needed a 512K zero block. However as a zero block had already been allocated it didn't bother to allocate another one. It just used what it had, which was too small. So it fell off the end and got the result we saw. I don't know why this works in 3.2.1 where it didn't work in 3.1.4. However when it successfully recovers from the backup it should update the metadata so that it knows it has successfully recovered and doesn't need to recover any more. So maybe the time it worked, it found there wasn't any recovery needed and so didn't allocate a 'zero' buffer until it was working with the old, bigger, chunk size. Anyway, this is easy to fix which I will do. It only affects restarting a reshape of a double-degraded RAID6 which reduced the chunksize. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array? 2011-04-05 11:31 ` NeilBrown @ 2011-04-05 11:47 ` Brad Campbell 0 siblings, 0 replies; 12+ messages in thread From: Brad Campbell @ 2011-04-05 11:47 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid On 05/04/11 19:31, NeilBrown wrote: > It only affects restarting a reshape of a double-degraded RAID6 which reduced > the chunksize. Talk about the planets falling into alignment! I was obviously holding my mouth wrong when I pushed the enter key. Thanks for the explanation. Much appreciated (again!) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array? 2011-04-05 6:10 ` NeilBrown 2011-04-05 9:02 ` Brad Campbell @ 2011-04-08 1:19 ` Brad Campbell 2011-04-08 9:52 ` NeilBrown 1 sibling, 1 reply; 12+ messages in thread From: Brad Campbell @ 2011-04-08 1:19 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid On 05/04/11 14:10, NeilBrown wrote: > I would suggest: > copy anything that you need off, just in case - if you can. > > Kill the mdadm that is running in the back ground. This will mean that > if the machine crashes your array will be corrupted, but you are thinking > of rebuilding it any, so that isn't the end of the world. > In /sys/block/md0/md > cat suspend_hi> suspend_lo > cat component_size> sync_max > root@srv:/sys/block/md0/md# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md0 : active raid6 sdc[0] sdd[6](S) sdl[1](S) sdh[9] sda[8] sde[7] sdg[5] sdb[4] sdf[3] sdm[2] 7814078464 blocks super 1.2 level 6, 512k chunk, algorithm 2 [10/8] [U_UUUU_UUU] [=================>...] reshape = 88.2% (861696000/976759808) finish=3713.3min speed=516K/sec md2 : active raid5 sdi[0] sdk[3] sdj[1] 1465146368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU] md6 : active raid1 sdp6[0] sdo6[1] 821539904 blocks [2/2] [UU] md5 : active raid1 sdp5[0] sdo5[1] 104864192 blocks [2/2] [UU] md4 : active raid1 sdp3[0] sdo3[1] 20980800 blocks [2/2] [UU] md3 : active raid1 sdp2[0] sdo2[1] 8393856 blocks [2/2] [UU] md1 : active raid1 sdp1[0] sdo1[1] 20980736 blocks [2/2] [UU] unused devices: <none> root@srv:/sys/block/md0/md# cat component_size > sync_max cat: write error: Device or resource busy root@srv:/sys/block/md0/md# cat suspend_hi suspend_lo 13788774400 13788774400 root@srv:/sys/block/md0/md# grep . sync_* sync_action:reshape sync_completed:1723392000 / 1953519616 sync_force_parallel:0 sync_max:1723392000 sync_min:0 sync_speed:281 sync_speed_max:200000 (system) sync_speed_min:200000 (local) So I killed mdadm, then did the cat suspend_hi > suspend_lo.. but as you can see it won't let me change sync_max. The array above reports 516K/sec, but that was just on its way down to 0 on a time based average. It was not moving at all. I then tried stopping the array, restarting it with mdadm 3.1.4 which immediately segfaulted and left the array in state resync=DELAYED. I issued the above commands again, which succeeded this time but while the array looked good, it was not resyncing : root@srv:/sys/block/md0/md# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md0 : active raid6 sdc[0] sdd[6](S) sdl[1](S) sdh[9] sda[8] sde[7] sdg[5] sdb[4] sdf[3] sdm[2] 7814078464 blocks super 1.2 level 6, 512k chunk, algorithm 2 [10/8] [U_UUUU_UUU] [=================>...] reshape = 88.2% (861698048/976759808) finish=30203712.0min speed=0K/sec md2 : active raid5 sdi[0] sdk[3] sdj[1] 1465146368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU] md6 : active raid1 sdp6[0] sdo6[1] 821539904 blocks [2/2] [UU] md5 : active raid1 sdp5[0] sdo5[1] 104864192 blocks [2/2] [UU] md4 : active raid1 sdp3[0] sdo3[1] 20980800 blocks [2/2] [UU] md3 : active raid1 sdp2[0] sdo2[1] 8393856 blocks [2/2] [UU] md1 : active raid1 sdp1[0] sdo1[1] 20980736 blocks [2/2] [UU] unused devices: <none> root@srv:/sys/block/md0/md# grep . sync* sync_action:reshape sync_completed:1723396096 / 1953519616 sync_force_parallel:0 sync_max:976759808 sync_min:0 sync_speed:0 sync_speed_max:200000 (system) sync_speed_min:200000 (local) I stopped the array and restarted it with mdadm 3.2.1 and it continued along its merry way. Not an issue, and I don't much care if it blew something up, but I thought it worthy of a follow up. If there is anything you need tested while it's in this state I've got ~ 1000 minutes of resync time left and I'm happy to damage it if requested. Regards, Brad ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array? 2011-04-08 1:19 ` Brad Campbell @ 2011-04-08 9:52 ` NeilBrown 2011-04-08 15:27 ` Roberto Spadim 0 siblings, 1 reply; 12+ messages in thread From: NeilBrown @ 2011-04-08 9:52 UTC (permalink / raw) To: Brad Campbell; +Cc: linux-raid On Fri, 08 Apr 2011 09:19:01 +0800 Brad Campbell <lists2009@fnarfbargle.com> wrote: > On 05/04/11 14:10, NeilBrown wrote: > > > I would suggest: > > copy anything that you need off, just in case - if you can. > > > > Kill the mdadm that is running in the back ground. This will mean that > > if the machine crashes your array will be corrupted, but you are thinking > > of rebuilding it any, so that isn't the end of the world. > > In /sys/block/md0/md > > cat suspend_hi> suspend_lo > > cat component_size> sync_max > > > > root@srv:/sys/block/md0/md# cat /proc/mdstat > Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] > md0 : active raid6 sdc[0] sdd[6](S) sdl[1](S) sdh[9] sda[8] sde[7] > sdg[5] sdb[4] sdf[3] sdm[2] > 7814078464 blocks super 1.2 level 6, 512k chunk, algorithm 2 > [10/8] [U_UUUU_UUU] > [=================>...] reshape = 88.2% (861696000/976759808) > finish=3713.3min speed=516K/sec > > md2 : active raid5 sdi[0] sdk[3] sdj[1] > 1465146368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] > [UUU] > > md6 : active raid1 sdp6[0] sdo6[1] > 821539904 blocks [2/2] [UU] > > md5 : active raid1 sdp5[0] sdo5[1] > 104864192 blocks [2/2] [UU] > > md4 : active raid1 sdp3[0] sdo3[1] > 20980800 blocks [2/2] [UU] > > md3 : active raid1 sdp2[0] sdo2[1] > 8393856 blocks [2/2] [UU] > > md1 : active raid1 sdp1[0] sdo1[1] > 20980736 blocks [2/2] [UU] > > unused devices: <none> > root@srv:/sys/block/md0/md# cat component_size > sync_max > cat: write error: Device or resource busy Sorry, I should have checked the source code. echo max > sync_max is what you want. Or just a much bigger number. > > root@srv:/sys/block/md0/md# cat suspend_hi suspend_lo > 13788774400 > 13788774400 They are the same so that is good - nothing will be suspended. > > root@srv:/sys/block/md0/md# grep . sync_* > sync_action:reshape > sync_completed:1723392000 / 1953519616 > sync_force_parallel:0 > sync_max:1723392000 > sync_min:0 > sync_speed:281 > sync_speed_max:200000 (system) > sync_speed_min:200000 (local) > > So I killed mdadm, then did the cat suspend_hi > suspend_lo.. but as you > can see it won't let me change sync_max. The array above reports > 516K/sec, but that was just on its way down to 0 on a time based > average. It was not moving at all. > > I then tried stopping the array, restarting it with mdadm 3.1.4 which > immediately segfaulted and left the array in state resync=DELAYED. > > I issued the above commands again, which succeeded this time but while > the array looked good, it was not resyncing : > root@srv:/sys/block/md0/md# cat /proc/mdstat > Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] > md0 : active raid6 sdc[0] sdd[6](S) sdl[1](S) sdh[9] sda[8] sde[7] > sdg[5] sdb[4] sdf[3] sdm[2] > 7814078464 blocks super 1.2 level 6, 512k chunk, algorithm 2 > [10/8] [U_UUUU_UUU] > [=================>...] reshape = 88.2% (861698048/976759808) > finish=30203712.0min speed=0K/sec > > md2 : active raid5 sdi[0] sdk[3] sdj[1] > 1465146368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] > [UUU] > > md6 : active raid1 sdp6[0] sdo6[1] > 821539904 blocks [2/2] [UU] > > md5 : active raid1 sdp5[0] sdo5[1] > 104864192 blocks [2/2] [UU] > > md4 : active raid1 sdp3[0] sdo3[1] > 20980800 blocks [2/2] [UU] > > md3 : active raid1 sdp2[0] sdo2[1] > 8393856 blocks [2/2] [UU] > > md1 : active raid1 sdp1[0] sdo1[1] > 20980736 blocks [2/2] [UU] > > unused devices: <none> > > root@srv:/sys/block/md0/md# grep . sync* > sync_action:reshape > sync_completed:1723396096 / 1953519616 > sync_force_parallel:0 > sync_max:976759808 > sync_min:0 > sync_speed:0 > sync_speed_max:200000 (system) > sync_speed_min:200000 (local) > > I stopped the array and restarted it with mdadm 3.2.1 and it continued > along its merry way. > > Not an issue, and I don't much care if it blew something up, but I > thought it worthy of a follow up. > > If there is anything you need tested while it's in this state I've got ~ > 1000 minutes of resync time left and I'm happy to damage it if requested. No thank - I think I know what happened. Main problem is that there is confusion between 'k' and 'sectors' and there are random other values that sometimes work (like 'max') and I never remember which is which. sysfs in md is a bit of a mess.... one day I hope to completely replace it (with back compatibility of course...) Thanks for the feedback. NeilBrown > > Regards, > Brad > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What the heck happened to my array? 2011-04-08 9:52 ` NeilBrown @ 2011-04-08 15:27 ` Roberto Spadim 0 siblings, 0 replies; 12+ messages in thread From: Roberto Spadim @ 2011-04-08 15:27 UTC (permalink / raw) To: NeilBrown; +Cc: Brad Campbell, linux-raid hi neil, with time i could help changin sysfs information (add unit information at output, and remove it (unit text) from input) what's the current kernel version of develop? 2011/4/8 NeilBrown <neilb@suse.de>: > On Fri, 08 Apr 2011 09:19:01 +0800 Brad Campbell <lists2009@fnarfbargle.com> > wrote: > >> On 05/04/11 14:10, NeilBrown wrote: >> >> > I would suggest: >> > copy anything that you need off, just in case - if you can. >> > >> > Kill the mdadm that is running in the back ground. This will mean that >> > if the machine crashes your array will be corrupted, but you are thinking >> > of rebuilding it any, so that isn't the end of the world. >> > In /sys/block/md0/md >> > cat suspend_hi> suspend_lo >> > cat component_size> sync_max >> > >> >> root@srv:/sys/block/md0/md# cat /proc/mdstat >> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] >> md0 : active raid6 sdc[0] sdd[6](S) sdl[1](S) sdh[9] sda[8] sde[7] >> sdg[5] sdb[4] sdf[3] sdm[2] >> 7814078464 blocks super 1.2 level 6, 512k chunk, algorithm 2 >> [10/8] [U_UUUU_UUU] >> [=================>...] reshape = 88.2% (861696000/976759808) >> finish=3713.3min speed=516K/sec >> >> md2 : active raid5 sdi[0] sdk[3] sdj[1] >> 1465146368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] >> [UUU] >> >> md6 : active raid1 sdp6[0] sdo6[1] >> 821539904 blocks [2/2] [UU] >> >> md5 : active raid1 sdp5[0] sdo5[1] >> 104864192 blocks [2/2] [UU] >> >> md4 : active raid1 sdp3[0] sdo3[1] >> 20980800 blocks [2/2] [UU] >> >> md3 : active raid1 sdp2[0] sdo2[1] >> 8393856 blocks [2/2] [UU] >> >> md1 : active raid1 sdp1[0] sdo1[1] >> 20980736 blocks [2/2] [UU] >> >> unused devices: <none> >> root@srv:/sys/block/md0/md# cat component_size > sync_max >> cat: write error: Device or resource busy > > Sorry, I should have checked the source code. > > > echo max > sync_max > > is what you want. > Or just a much bigger number. > >> >> root@srv:/sys/block/md0/md# cat suspend_hi suspend_lo >> 13788774400 >> 13788774400 > > They are the same so that is good - nothing will be suspended. > >> >> root@srv:/sys/block/md0/md# grep . sync_* >> sync_action:reshape >> sync_completed:1723392000 / 1953519616 >> sync_force_parallel:0 >> sync_max:1723392000 >> sync_min:0 >> sync_speed:281 >> sync_speed_max:200000 (system) >> sync_speed_min:200000 (local) >> >> So I killed mdadm, then did the cat suspend_hi > suspend_lo.. but as you >> can see it won't let me change sync_max. The array above reports >> 516K/sec, but that was just on its way down to 0 on a time based >> average. It was not moving at all. >> >> I then tried stopping the array, restarting it with mdadm 3.1.4 which >> immediately segfaulted and left the array in state resync=DELAYED. >> >> I issued the above commands again, which succeeded this time but while >> the array looked good, it was not resyncing : >> root@srv:/sys/block/md0/md# cat /proc/mdstat >> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] >> md0 : active raid6 sdc[0] sdd[6](S) sdl[1](S) sdh[9] sda[8] sde[7] >> sdg[5] sdb[4] sdf[3] sdm[2] >> 7814078464 blocks super 1.2 level 6, 512k chunk, algorithm 2 >> [10/8] [U_UUUU_UUU] >> [=================>...] reshape = 88.2% (861698048/976759808) >> finish=30203712.0min speed=0K/sec >> >> md2 : active raid5 sdi[0] sdk[3] sdj[1] >> 1465146368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] >> [UUU] >> >> md6 : active raid1 sdp6[0] sdo6[1] >> 821539904 blocks [2/2] [UU] >> >> md5 : active raid1 sdp5[0] sdo5[1] >> 104864192 blocks [2/2] [UU] >> >> md4 : active raid1 sdp3[0] sdo3[1] >> 20980800 blocks [2/2] [UU] >> >> md3 : active raid1 sdp2[0] sdo2[1] >> 8393856 blocks [2/2] [UU] >> >> md1 : active raid1 sdp1[0] sdo1[1] >> 20980736 blocks [2/2] [UU] >> >> unused devices: <none> >> >> root@srv:/sys/block/md0/md# grep . sync* >> sync_action:reshape >> sync_completed:1723396096 / 1953519616 >> sync_force_parallel:0 >> sync_max:976759808 >> sync_min:0 >> sync_speed:0 >> sync_speed_max:200000 (system) >> sync_speed_min:200000 (local) >> >> I stopped the array and restarted it with mdadm 3.2.1 and it continued >> along its merry way. >> >> Not an issue, and I don't much care if it blew something up, but I >> thought it worthy of a follow up. >> >> If there is anything you need tested while it's in this state I've got ~ >> 1000 minutes of resync time left and I'm happy to damage it if requested. > > No thank - I think I know what happened. Main problem is that there is > confusion between 'k' and 'sectors' and there are random other values that > sometimes work (like 'max') and I never remember which is which. sysfs in md > is a bit of a mess.... one day I hope to completely replace it (with back > compatibility of course...) > > Thanks for the feedback. > > NeilBrown > > >> >> Regards, >> Brad >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Roberto Spadim Spadim Technology / SPAEmpresarial -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2011-04-08 15:27 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-04-03 13:32 What the heck happened to my array? (No apparent data loss) Brad Campbell 2011-04-03 15:47 ` Roberto Spadim 2011-04-04 5:59 ` Brad Campbell 2011-04-04 16:49 ` Roberto Spadim 2011-04-05 0:47 ` What the heck happened to my array? Brad Campbell 2011-04-05 6:10 ` NeilBrown 2011-04-05 9:02 ` Brad Campbell 2011-04-05 11:31 ` NeilBrown 2011-04-05 11:47 ` Brad Campbell 2011-04-08 1:19 ` Brad Campbell 2011-04-08 9:52 ` NeilBrown 2011-04-08 15:27 ` Roberto Spadim
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).